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SgCTION 1f INTRODUCTION 



p ' •• Realtime f light*.training» simulators generally use 

several single-ins£ruction single-data path (SISD) computers 
to attain the* required processing capability . This is. 

% similar to the capability offered on a smaller scale by a s 
multiple-instruction multiple-data path (MIMD) computer. 

'until recently, however, a ^practical functioning MIMD 
puter had not been implemented — all predictions of 
increased speed, and fidelity with MIMD architecture were 
purely theoretical. Even though an operating MIMD computer 
now exists, there are still problems obtaining the' maximum 
efficiency from the software. Because the trend is toward 
more parallel computer processing ancf parallel processing 
configurations, the Air Force sponsored this study to 
* develop the technology needed to "take advantage of 'the • 

benefits offered by MIMD architecture. The purposes of this 
study • were to det,errtiine which software techniques are most 
practical to implement, and to determine the implications of m [ 

- using an MIMD computer in real-time simulation. . 



Computer Architecture 



'The machine used in the study was D^nelcor, Inc.'s 
Heterogeneous ^Element Processor (HEP). HEP is an MIMD 
machine of the shared resource type as defined by Flynn 1 . 
In this type of organization, skeleton processors compete 
for 'execution resources in either space or time. For 
example, the set of "per ipheraft processor^ of the CDC 6600 
may be viewed as an MIMD machine implemented by the time- 
multiplexing of ten prb^ess states to one functional unit. 



1m. J, Flynn. "Very High Spe^d Computifi^ Systems". 
Proceedings IEEE , 54 (1966), pp. 1901-1909. 
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In a HEP processor, two queues/ time-multiplex the 'pro- 
cess states.' One queue provides input to a pipeline that 
fetches a bhree-address instruction, depodes it, obtains the 
two operands, and sends the information to" one of several* 
-pipelined function units that complete the operation. If 
the operation is a data memory access, the process state 
enters a second queue. This queue provides input to a pipe- 
lined switch that interconnects/ several data memory modules 
with several processors." When /the memory access is com- 
plete, the process state is returned to the first queue. 
Figure 1 'shows the processor organ i-zat ion , and Figure 2 
* Shows, the system layput. 1 



Each HEP processor suppqrts up, to 128 processes, and 
nominally begins executing a/ new instruction (on behalf of 
"some process) every 100 nano/seconds ( ( ns) . The time required 
to complete an ' instruction is 800 ns.. Thus. if at least 
,eight" independent processed/ (processes that do not shajre 
data) are executing in* one processor, the instruction execu- 
tion rate. is Ijo'' instructions per second per processor. 



HEP ihstruction6 and data words are 64 bits wide. The 
floating-point 'format is sign magnitude with a hexadecimal, 
seven-bit ,/ £5ccess-64 exponent. All function units, except 
the divider, ^xecute one instruction every 100 ns. The 
divider can support this rate momentarily but is slower on 
the average. > / 



Tasks 



Since HEP attains maximum speed wh^n all of its pro- 
cesses are' independent at simple set of protection mecha- 
nisms is ^Incorporated to allow potentially hostile users to^ 
execute simultaneously. A d6main of protection ih HEP is 
called a task, and/ consists of a se-t of processes, with the 
game task identifier (TID) in their process states. The * TlD 
-specifies a task/status word that contains base and limit 
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Figure 1 - Processor Organization 
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PROCESSOR 
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PROCESSOR 




addresses defining the. regions within the various .memories 
accessible by the processes in that task. In this way, prb- 
cesses within a task may cooperate but are prevented from 
communicating with those .in other tasks. Processes in dif- 
ferent tasks' or processors may communicate via data memory 
if ' they have overlapping allocations there. 

Processesare^a scarce resource in HEP. In addition, 
'the synchronization -primitives used in.HEP make processes 
"difficult to virtualize. 'As a result,/ the maximum number of 



processes a task. uses must be specified to the system when 
the task. is loaded. The operating system insures that the 
total allocation of processes, to tasks does, not exceed the 
number available. A create fault (too many processes) can 
occur only when one or more tasks-have created more pro- 
cesses than they were allocated. In this event, the offend- 
ing task or tasks (not necessarily the task that actually 
caused- the create fault) are removed from the processor. 

Protection violations, create faults,' and other error 
conditions arising within a process cause traps. ■ A trap is 
the creation of a process executing in a. supervisor task. 
Sixteen tasks are avail-able in" each . processor ; eight are • 
user tasks-and the other' eight are corresponding supervisor 
tasks. When a process causes a trap,' the entire task Is 
made dormant" to prevent farther execution' by an^ process .m 
it A process is created in. the fcorrespondirig supervisor 
task to handle the condition.- This scheme is not used for 
'creatV fault, hdwever^a create fault/suspends ^execution of 
; aip. processes., regardless of task, except those actually 
handling the fault. • ' 1 

'* Create fault occurs before all pro^ssear have been 
used. This allows any* create instructions in* progress to 
complete normally, and allows" for the creation of 'the create 
' fault handler process. Ail- other traps in; HEP are precise 
in the sense that they prevent the 'execution of any subse- 
quent instructions in the offending task. 
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Synchronization • ■ 



Any register or .data memory location in HEP can be. used 
^to synchronise two processes in a producer-consumer fashion. 
This requires three access states: a reserved state to pro- 
vide for. mutual exclusion, a full state, and an empty state. 
When an instruction executes, it tests the states of loca- 
tions and modifies them indivisibly. Typically an instruc- 
tion test's its sources -full and its. destination empty. If a 
test -fails, the process reattempts the instruction on its 
next turn for servicing. If all tests Succeed, the process 
executes the instruction and e sets both sources empty and- the 
; destination reserved, tfhe operands from the sources are 

sent to the function unit, and the program oounter in the 
vProcess state is incremented. When the function .unit 
eventually writes a result in the destination, it sets the 
destination, full,. * . ' 

" A des-tinatlo/i may be tested full 'rather tftarT'empty ,^-to 
» preserve the state of a .source or to override the state bf a 
source or destination. A reserved state, however, may not 
be overbidden except by certain privileged instructions. 

Input-output synchronization -is handled' naturally by 
mapping I/O device registers into data memory address space » 
(an interrupt handler i-s just a process that is .attempting 
to read an input location or write an output location) . I/O 
■device addresses are not relocated by the data memory base 
•address. All I/O-addressed operations are privileged. 



Switch 



, 4 * 

. 4 The switch that interconnects processors and data 
memories to allow memory sharing consists of a number of X 
nodes connected by ports. Each node has three ports and catf 
simultaneously v send and receive^ message on each port. The 



messages contain the address of the .recipient, the address • 
of the originator, topper at ion to be performed by the 
recipient, and a priority. Each switch. node receives a 
message on each port every 100 ns. The node attempts to 
-retransmit each message on a port that reduces the distance 
of that message from its recipient; for this purpose, each 
node has a table that maps the recipient address into the 
number of a port that reduces distance. If there is 
conflict for a port, the node routes one message correctly 
and the rest incorrectly. To help insure fairness, an 
incorrectly routed message has its priority incremented as 
it passes through the node. Preference is given in 
conflicts to the message with the highest priority. 

The success or failure of the operation (based on the 
access state of the memory location) must be reported back 
to the processor so it can decide whether to reattempt the 
operation. Thus, the time required to complete a memory 
operation via the switch includes two message transmission 
times, one in each direction. t 

The propagation delay through a node and its associated 
wiring is 50 ns. Since a message is distributed' among two 
or three nodes at any instant, the switch is fcwo-colorable 
to avoid conflicts between the beginning, of one message and 
the middle of another.' When the switch fills -up. due to .a 
high conflict rate, misrouted messages begin to "leak". 
Every originator is obliged to reinsert a leaking message 
before inserting a new miss age\ Special measures are taken 
when the priority reaches its maximum value. This avoids ^ 
indefinite delays for such messages. A preferable scheme 
would have been to establish "priprity by' time .of ' message , 
creation, but this would have required too many bijjfs. 

FORTRAN Extensions ■ m * 



Two extensions ta FORTRAN allow parallelism' in source 
programs. First, subroutines, may execute in parallel with 



their callers, -either by being CREATEd instead of-CALLed or 
by executing a RESUME before a RETURN. Second, variables 
and arrays whose names begin with W $ B may be used to 
transmit data between two processes via the full-empty 
discipline. A simple program to add the elements of an 
array $A is shown in Figure 3. Thel subroutines INPUT and 
OUTPUT perform obvious- functions; the "subroutine ADD adds 
the^ elements. There are a total of 14 processes executing 
.as- a' result of running the program — the main Rrogram 
itself, the INPUT and OUTPUT subroutines and 11 copies / of . 
ADD . *' < " • ; 

As a parallel computer, HEP has an advantage over SIMD 
machines and jnore loosely coupled MIMD machines in solving ' 
large systems of ordinary differential equations that simu- 
late continuous systems .« In this application, vector opera- 
tions are difficult to apply because of the precedence con- 
straints in the equations, and loosely coupled MIMD organiz- 

. ations are hard ta use feecausp a good partition of the pro- 
blem to share workload and minimize communication is hard to 
find. Scheduling becomes relatively easier as the number of 

* processes increases, it is quite simple with orte process 
per instruction as in a data flow architecture. • 



Problem Selection 



The contractor principal .investigator and the. Air Force 
. dontract nfoftitor had a Urge number <?f programs and program 

segments to examine and select. These included tens of 
• thousands 7 of/ source lines provided by the Ait Force contract 
monitor, and several programs prpvided by the contractor * 
principal investigator. The contract ^monitor provided the 
singulation^ systerti for the T-38B and A-10 aircrafts. These 
programs are clearly most representative of current and - 
future simulation programs. m A complete program, however/- 
\ was too large for the scope of this study. Futher, th^ese 
programs supported a "map in the loop" and had inputs and 
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c add up the elements of 

» 

C THE AfcRAY $A 

REAL $A(1000,$S( 16) ;$SUM 

INTEGER I ' - , 

CREATE INPUT ( $A, 1 0T)0 ) 

DO 10 1=1,10 ' % 

CREATE ADD ($A( 1 0 0 * I -94) ) ,$S(I) ,T00) 

10 CONTINUE 

* CREATE ADD ( $S , $SUM , 1.0 ) 

CREATE OUTPUT($SUM,1 ) 
v END 

C NOELTS ELEMENTS OF "$V ', 

C ARE ADDED AND PLACED IN $ANS 

SUBROUTINE $ADD( $V,$ANS , NOELTS) 

REAL $V(1 ) ,$ANS,TEMP 
* INTEGER J, NOELTS 

TEMP=0.0 

DO 20 J=1 , NOELTS , 

TEMP=TEMP+$V( J) . 
20 CONTINUE 
$ANS=TEMP 
RETURN 

END * ^ . ' 
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«.*..-. . . - : ' * . • . 

.outputs external- to / the computer. Thus t , the T-38B and A-110.. 
simulation programs provided interesting -code 'segments for 
analysis, but could not be executed "on HEP. Pour subrou-1 
tines, however -.- constituting, the solution of the flight 
equations for< the A- 10 aircraft — were selected for the 
study. The sam6 subroutines irt the T-38B simulation used'- 
more than 50% of all of the CPU 'cycles used by the total 

.simulation. Thus it was felt that -the results gained from 
studying these subroutines could be extrapolated to the 
entire simulation. - " * - 

To include a complete program whose serial' and paral- 
lel versions could be executed on HEP, the contractor 

, furnighed a program that simulates the flight characteris- 
tics of a ground-launched missile. ' This program is a 

-sequential FORTRAN program of 442 source lines that solves a 
set of 10 nonlinear, first-order differential equations. 

The code supplied by the Air_JPcfce contract monitor 
included a library of ■ mathematical functions' that many of 
the modules invoke. Thus, one elementary function and two 
mathematical functions we're also included in the study. 
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SECTION 2: PARALLBLISH AT THE MACHINE INSTRUCTION BEVEL 



Elementary Functions 



A significant computational - task in' any scientific » 
computing activity is approximating elementary functions 
(SIN,, LOG, SQRT, etc.)** The extensive mathematical library 
in the listings supplied by the Air Force contract monitor^ 
indicates that this is the cake for flight simulation. 



Evaluating Polynomials, 



Since the very beginnings of electronic digital 
computing, the preferred ;method Of approximating elementary 
functions has been polynomials. Vfe have found no evidence 
that parallel computing -alters this choice. Thqs we 
concentrate on parallel methods of evaluating polynomials. 

The evaluation of a polynomial/ of degfee n, 



P n (x) = a 0 + ajx + ... + a n x" 
requires 2n operations 2 : Thus Horner's rule 



p rv = a n 



p i = < p i+1 x ^- + a i i = P~ 1 ' n ~ 2 ' * * * '° 
P n < x > = p 0 " 



2 F.. Winograd. "On The Number of Multiplications Required 
to Compute Certain Functions". Proceedings, National 
Academy of Science USA ,*Vol> 58 '( 1?67>, pp. f840-1842. 
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is optimal or SISD computers because jLt requires precisely 
2n operations and 2n time steps. But we know that, given n 
processor*, the lower bound for polynomial evaluation is 
[lo<j2n]+1 time steps^. From this, and future examples, it 
is clear that Horner 1 s f rule is no longer optimal for MIMD 
computing, where execution time is the criterion. 

To describe techniques for evaluating polynomials, we 
require the following* notational conventions: 

(n*,m) denotes a polynomial * of n terms in which the 
smallest is multiplied by x m , and 



(0,m) denotes the varia{?l£ x to the mth power. 



To analyze the performance of the algorithms, we assumed: 

(a.) a sufficient number of processors that execute 
arithmetic (,add, multiply) in one time step are 
available, 

(b) results of an operation are available to all 
processors in the next time gfcep, 



(c) processors suspend operations until, all operands 
r are available, and 

(d) there is no ope|ational overhead in assigning a 
process or performing an operation. \ . 

For HEP, assumptions a, b, and c present no problem so 
long as "sufficient" does not exceed the number available 
(for elementary f unct iorjf§,, this is the case). In general, 

s 



^Ian Munro. "Optimal Algorithms^ for Parallel Polynomial 
Evaluation" . Journal of Computer and System Sciences , 1 973 , 
pp. 189-198. 

♦ * 
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assumptions will not holdr as will be seen in the code for 
^elementary function, however, the assumptions do hold by use 
of pertain coding, practices. > 

A. straightforward method of evaluating a polynomial 
P n (x) is to decompose' it into two polynomials of lesser 
y degree. This method computes P n ( x ) as 



where 



and 



P n (x) s Q n/2 ( x )- xn/2+1 + R n /2' (X) 



, s n/2-1 
Q n / 2 (X) * < V2 + 1 



R n/2^ a n/2 xD/2 + • • • + a 0 



and then computes Q n / 2 ^ x ^ and R n/2* x * similarl Y by ^ . 
binary splitting. Thus it starts by computing in parallel 

a^+ag , a 3 x+a 2 , a 5 x+a 4 , ... , a n x+a nwl 



and then 

(AjX+a^x 2 + ajX+a^f (a 7 x+a 6 )x 2 + a 5 x+a 4 ,... • " 9 f 

\ The time required for this algorithm is approximately 
• * 2 log 2 n - 

This algorithm can be improved by performing the binary 
splitting in the Fibonacci ratio instead of in halves. Let 
F(i) denote the ith element of the Fibonacci sequence^ 

1,1,2,3,5,8,13,21,... 



and . for a polynomial of .degree n determine the least i such 
that F(i) £ n+1. * We then split the evaluation of the poly- 
nomial by: 
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(n+1,0) = [n+1-F(i-1),0] [0/F(i-1N + [F( i-t )-,«]\ < 

P 

The execution time for 'large .ri is » a* log n + v O(log' a) where 



a- 1/log[1/2('5 + !).]'« 1.44. x o 

/ . . X 

An example of the u^e of Fibonacci splitting to evaluate a 
polynomial of degree 20 -is shown in Figure 4. 

Improvements to the Fibonacci splitting -method have 
been reported ,(Hu«ro, 1973), but the improvements appear 
only for very large values of n. For elementary functions 
where the degree of the polynomials is generally % less* than 
20 a .discussion of these improvements does ^not seem *„ 
warranted.^Table 1 presents the largest degree polynpmial~- 
thit roay be evaluated in t steps using Fibonacci splitting 
versus using the best known algorithms. ^ 



- % t = '2 3 4 5 6(.7 8.9 .10^11 

Fibonacci • . 1 2 4 7 12 20 33 • 54 88 * 143 

Best Known 1 2 4, 7. J 2 21 37 63 107 187 



- ■ . * i 

Table 1 - Greatest Degree of; Polynomial Computable in Time t 




In addition to the splitting techniques^ ;>4^J|er}£rraliza- , ' 
tion Of Hbtrner's r 4 ule to make it amenable to parallel com- 
puting has been reported by /uorn^. If the exec^t ion^ime 



of an addition and a multiplication are the same, however, 



S. Dorn. "Generalization of Horner's Rule for 
Polynomial .Evaluation" ./ IBM Journal , April 1962, pp. 
239-245. ** ■ • 
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Figure 4 - Fibonacci Splitting <ff ?20^ 
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this method requires 2 log n stejps ffor the evaluation of an 
ntii ord^r polynomial. 



Specific Examples - - 

X 

The approximation of elementary functions by an MIMD 
computer requires not only techniques for parallel* evalua- 
tion of polynomials but also techniques fbt\ generating coef- 
ficients of "best" approximations. The lafter subject has 
received extensive attention in the literature and is not 
acjdeessed here^. j 

* ^^The specific elementary functions chosen to be included 



\ 



^The interested reader is referred to Computer 
Approximations by J. P. Hart, et. al. (New York: Wiley & 
J. Sons, Inc., 1968). , * 
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in this study were the approximation of cosine and logarithm 

(b"ase e ). The algorithms, use a 64-bit floating point word 

with an 8-bit exponent (Radix 16) and a 56-bit normalized, 
fraction. . M * * 



Cosine 

The cosine function accepts an argument (A) in the 
range -16^ £ A £ 16^ a$d produces a result in the range 
-1 £ cos(A) < 1. The method used converts -the argument into j 
the range^ 0 £ x k 2n by the relationships 

cos(x) = cos(-x) =*cos(x+2Kn) . - 

Next," the argument is reduced to the range 0 to n/2 by the' * 
relationships pictured in Figure 5. 



-COS (* -X ) 




3n/2 



cos(x) 



cos ( 2 ir- 



Pigure 5 Cos 3 ine Function Relationships 



Finally, the function is approximated by a 9th degree poly 

nomial in the converted argument x 2 . That is: , 

* <• ■ 

, _ cos(x) = Pg(x 2 ) . 



More concisely: 

cos(a) = 

cosCb) = 



COsfb) «e(-1,1) 
b > 0 



cos(c + 2k w) 



ke(0,1 ,2, , 
0 < C < 2, 



,10 12 ) 



COS(C) = 5COs(y) 6 €(-1,1) 

/ 0 < y <■ / 2 

6, y defined -0 < c < /2 y = c 

" " * 6 + 1 
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2.4 



S 



\/2 < c < ir y^—n-c 

tK C < 3 t/2 ' y = C- n 
6 - -1 



3 V2 £ c < 2 tt. y = 2*-c 



cos(y) = P 9 (Y2) 
with coefficients 



PO 




+.9999 


9999 


9999 


9999 


9999 


3632 


9000 


E+0 


P1 




-.4999 


9999 


9999 


9999 


9948 


3628 "4300 JB+0 


P2 




+.4166 


6666 


6.666 


6665 


9736 


7005 


4000 


E-1 


P3 




-.1388 


8888 


8888 


8853 0208 

* 


2298 




E-2 


P 4 - 




+.2480 


1587' 


3014 


9274 


6422 


2970 




E-4 


P 5 




-.2755 


7319 


2096 


6674 


8555 






E-6 


' P6 




+.2087 


6755 


6674 


2345 


8605 






E-8 


P7 




-.1 147 


Q670 


1991 


7777 


701 1 






E-10 


P8 




+.4776 


8729 


8095 


7170 


J 






E-1 3 


P9 




-.1511 


9893 


7468 


8700' 








E-15 



.This polynomial approximation has an *absolute accuracy 
of 20.19 digits 10 (16.77 digits 16 ). Scaling the argument 
into th^range [0,2.*-]' causes a lo'ss of JTog -| 5 ( A/2ir"fj 
digits-j 5.' Therefore, the machine word size pf 14 digits^ 

should ^determine accuracy. ' % \ 

f " \ 

The approximation was programmed, and its accuracy . 
tested, with- 50 uniformly distributee? argument values in the 
ranga 0 to 2. The results were compared with 11£ bit rou- 
tines. Statistically the results were as sjibwn in Table 2. 
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I 




magnitude 


base 2 log 


• maximum absolute error 

♦ > 


1.01 x 10 -15 


-49.8/- 


maximum relative 1 error 


1 .99 x 1 0-J 5 


-48.8 

• 


^^aye«ige ^relative error 

Std. deviation of relative error 


4.01 x 10 -16 . 
4.01 x 1 O"" 16 


-51 .2 



Table 2 - Accuracy of Cosine Approximation 

The algorithm comprises the followiijg tasks; / 

• ^T^ - Remove sign from argument'- ^ , - -w/ 

T* - Scal*e Magnitude of argument into 0 to 2 
j^j. - Select quadrant reduction 

. , /T4 " Per form * reduction and- save* quadrant . sign 

T5 -'Evaluate approximation 

* T g - Combine approx. value and quadrant sign 

1 T7 - Empty multiple* last uses variabres 

The tasks have the following precedence * graph; t 




ERJC 



s 

All tasks 'except T5 have no internal parallelism or are more 
efficiently processed sequentially. T 5 "*," Evaluate approxi- 
mation", has the computational tree shown in Figure 6.. ^ 

1 



width 




Figure *6: Cosine Task 5 Computational Tree 



This routine was- programmed? for HEP and resulted in the 



r 



following performance: 

atotal number of instructions executed 60 

Number of instruction cycles used 24 „ 
Maximum, number of concurrent processes 6 

Average number of concurrent processes 2.50 
Planned number of waved off instructions .1 
Storage: 

Total words of: 

Program Memory 69 

Register* Memory . * 25 

Constant Memory .29 



*Thus we have achieved a speed-up of 2.5 in evaluating the 
cosine^f unbtiop.- 
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Logarithm 

The second elementary function examine'd was the approx- 
imation of logarithm base e. The method breaks the input 
range of the argument into two different ranges. Range I is 
(2')~ 1 / 2 < a < (2) 1 / 2 , where the function uses a direct 
rational approximation of the form: 

log(a) = z[P 3 (z 2 )/Q 3 (z 2 )] 

• • .'.si " : V * ' ' 

a+1 

For Range II, (2)" 1 /2 < a Q r (2) 1 /2 < a, we can use 
the following relationships to convert to base 2 logarithm 
and extract a bounded value to approximate: 

lbg e (a) - log e (2) • log 2 (a) 

log 2 (a) = log 2 (f-2 2 ) = n + lpg 2 (f), 1/2 < f < 1 

' We now approximate log 2 (f), the result of which we * 
.combine with n, then multiply by log e (2) for the final 
result. 

\log 2 (f) =yR 6 (y 2 ) - V2 
. y = 1/2(1 - [(2) 1/2 /(f + (2)" V2 )1) . 



More concisely: 11 

Range I: , 
for (2) 1/2 < a < (2) 1/Z 

• log e (a) =. z[P 3 (z 2 )/Q 3 (z 2 )] 

•z = (a-1)7(a+1) 

Range II • . 

for a < (2) 1/2 or (2) 1/z < a 

log e (a) = log e (2).,. log 2 (a) 

- 23 - 
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16' 



-N 



n e { ,-64, ... ,63) 
N ' {0,1,2,3} 



\ 

1/2 £ f < 

log 2 (a) » (4n - N - 1/2( + (log 2 (f) + 1/2) 
(log 2 (f) ' + 1/2) yR 6 (y 2 ) 



1/2 



y ='1/2 - (2) ,/z / (2 • f + (2) 1/2 ) 
Execution time decreases by 

log e (a).» log e (2), • (4n - N - 1/2) + yR' 6 (y 2 ) 
where coef of R.' = log e (2) times, coef of R: 

* 

log e (a) = z(P 3 (z 2 )/Q 3 (z 2 )) 
P 0 = -24.01 3917 9559 2105 10E+0 
P! ■ +30.95 7292 8215 3765 01E+0 " 
P 2 ■ -9.637 6909 3368 6865 93E+0 

- P 3 * +.4210 8737 1217 9797 15E+0 
g 0 - -12.00 6958 9779 6052 *55E+0 
gi * +19.48 0966 0700 8897 31E+0 
g 2 ="-8.911 1090 2793 7831 23E+0 

*G 3 « +1 .0 E+0 » - , 

log 2 (f) • log e (2).* yR' 6 {yh-^ 
r 0 » +4.000 0000 0000 0000y 67E+0 
rj -,+5.333 3333 3332 4188 96+0 
r 2 ■ +12.80 0000 0198 2788 68E+0 

* r 3 » +36.57 1412 4660 5914 90E+Q. 
r 4 « +113.7 8399 8715 0066 '37E+0 
r 5 » +371.1 3591 8715 6528 26E+0* 
rg » +1379, 3999 4910 9060 60E+0 



These approximations provide 19.38 digits^ (16,09 
digits 16 ) of "absolute accuracy for Range I, and "1 7. 18 
digits 10 (14.27 digits 16 ) of" relative accuracy for Range II, 
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AS vras the case With the approximation of cosine, this 
approximation was split into several tasks to facilitate 
parallelism. _The tasks are as follows (tasks within 
brackets U apply only to Range II )* 




T1 
IT 2 1 
[T 3 ] 

T6 

T 7 
[T 8 ] 
[T 9 ] 

Tio 



Select range 

Extract fraction (1/16*to -1) 
Extract exponent 

Select fraction and exponent adjustment values 

Adjust fraction (1/2 to 1 ) , 

Form approx. argument 

Evaluate approximation *. 

Form result exponent 

Combine exponent and approx. value c 

Empty mutli pie last use variables 



This set of tasks has the following precedence graph: 

[T 3 ]> 





[Tc] > Tg— >T 7 . 



•10 



T 1f "Select range% has the following parallelism: 




~7 

Tg, "Form approximate argument", has. the following 
parallelism: 




T 7 , '^Evaluate* approximation", has the computational trees 
shown\in Figure 7 • 

The remaining tasks have no internal parallelism or, are more 
efficiently processed sequentially* . • 

The logarithm approximation was tested using two sample 
sets of 100 uniformly distributed values in the range 0 to 2 
and 0 to 10 6 . The results were compared against 112 bit 
routines; the statistics on the accuracy obtained are given 
in Table 3. 



Hange (0,2) 



magnitude base 2 log 



maximum absolute error 7.61 x lO"^ 

maximum, relative error 8.35 X 10 

average relative error 3.20 x 10 

Std. deviation of relative error ' 1.84 x 10 



-16 
-^16 
-16 



-50.2 
-50.1 
-51 .5 



Range (0,10 6 ) 



magnitude base 2 log 



maximum absolute error 
maximum relative error 
average relative terror 
Std. deviation of relative error 



"6.55 
4.52 
3.15 

Hi 



10~ 16 
10-17 

i<r 1 . 7 . 

10-18 



-50.4 
-54.3 
-54.8 




table 3 - Accuracy of Logarithm Approximation 
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Range* I 

numerator 



4 



denominator * 




z(P 3 (* 2 )/Q 3 U 2 )) 



total 17 
operations 



s 



Range II 



step 
1 



2 V V'V 



' ' v , v v r V 

♦ r» v * 7 r 



I V 



2 \ 2 



N * * * 



A 




* 

,4 ' 





y r 0 y r l y width 

4 
5 
5 
4 
2 




V 



%R 6 (y 2 ) 



/ 



l, 

21 



Figure 7 *- Logarithm Task 7 Co«putation4^Trees 



The .approximations f6r the two ranges performed as follows: 

0 

f » 

Range I: 

Total number of instructions executed 47 

j 

Number of instructicm cycles used .30 

Maximum number of concurrent pifbcess.es 4 

Average number of concurrent processes 1197 
Planned number of waved off instructions^ A 



f Range II: 



9. 




Total number of instructions executed -66 
Number of instructions executed 4 , 32 

Maximum number of concurrent processes 5 
Average number of concurrent processes 2,28 
Planned number of waved off instructions 4 



Storage: 

Total words of: 

Program Memory 103 
Register Memory 26 
Constant Memory , 69 

Assuming that arguments in -the two ranges are equally 
probable, the speed-up of this algorithm is 2.125. 



General Code Sequences 

This section examines the problems of generating paffal- 
•lelism at the machine instruction level for general code* 
sequences. * The basid techniques. consist of tree height 
reduction methods*. In some cases * optimal algorithms exist, 
in others only heuristic methods apply. 

- ■ >. 

a r • ' • * 
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Tree Height Reduction Techniques . 

The basic entity to which tree height reduction applies 
is expression evaluating. - From simple fan-in arguments it 
is clear that given an expression of n distinct atoms and 
involving the binary operations of„ addition, subtraction, 
multiplication and division, a lower bound on the tree / * 
height is [log 2 n] . In many cases, however, the tree^produc- 
ed by an ordinar^ compiler does not achieve this lower 
bound. Thus we consider associativity, commutat ivity , and 
distr ibutivity to reduce tree height. 

Consider the following expression and its tree, repre- 
sentation: 




(A + (B * C) ) + D* ■ ^7 

A 



By associativity and commutat ivity, we can reduce the 
expression and its tree height to: 



(B * C) + (A.+ D) 




+ 



The original expression could use only one processor 
for its evaluation and required three time steps, whereas 
the transformed expression can be evaluated tjy two (proces- 
sors in only, two time steps. If we restrict ourselves to 



- 29 - 34 



associativity and ccfmmutati vity , algorithms presented by 
4 Baer and Bovet 6 have been shown to be optimal. But distri- 
; bujtivity can also reduce tree heights. Consider: 



<4 



A+B(G+D*E*F 




This expression has a tree height of 6 and can be reduced* by 
associativity and commutatiyity to ct tree height of 5, By* 



also using th6 laws of distributivi 



y, hoyever> we produce: 



/ A+G + B*C + B*D*E*F 




c 



6j. L. Baer and'D. P. Bo^t. "(Compilation of Arithmetic 
Expressions for Parallel Computation" . Information 
PrQcessing '68 , North Holland Publishing Cofhpany, Amsterdam, 
1969, „ • - 
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V 




V- 



This has a tree height of 3. Using four processors would 
result in a spded-up of 2 over the original" expression.^. 
Unfortunately/ we cannot just distribute a multi-plication 
across & parenthesis and reduce tree height. For example^ * 



A * B * (C + D) 

/ 

has a tree height of 2. Using, the distributive law, we get, 
A ** B * C + A *• B * D 



which has a tree height of 3. There are-gopd algorithms 
that reduce tre£ height using distributivity , associativity, 
and commutativity , but they are not, necessarily optimal. 



We now consider multiple expressions, a^ 'would be the 
case with a set of assignmer^^^statements. 'Ccftisider:** 

* B 





This block of assignment statements Wis a tree heighjb of 4, ^^-3 
-which may be reduced to 3 tyy back substitution: 

A = B * C * D • 
E = P*B*C*D t . .. 

• G = E * B * C * C + H * 
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Observe, however, 'that considerably mor.e; operations have 
hetn introduced t/h achieve this reductions , 

In addition to arithmetic expression, linear recur- 
rences of ffr *a possibility for significant speed-ups. - . 
"Consider the* linear recurrence represented by the following 
. nested DO loops: 

• * * ; % 

DO • 3 I = 1 , 10 v - 
' DO 3 J = 2, VQ j 
3 A (I, J) = A (I,Vl ) + B( J) 

-* • . ■ 

The* .outer loops can be done simultaneously as: 

.. v> ' DO 31. J =2,10 , . s 

• 31 A(1,J) = A(1,J-1) + B(J) ' 

DO 32 J' = 2, 10- - A 
• 32 A(2,J) = A92,J-1) + B(,J0 ' 



Further, the interior of each loop is just: 

> * 

. ' A(I,J) » A(I,1) + B(-2) +•••• + B(J) 



This expression can be evaluated in logarithmic speed. 
Hence the total speed-up could be as high as 2K teut this is 
at .a cost of considerably more .code. Also*, the efficiency 
in this example is orily 50%. 

Another area thar^ has been stTudied i^conditional 

branches represented by IF statements. For ' an isolated -IF 

"statement, all instruction streams must be funneled through 

the branch, as with a JOIN statement. But a seption of code 



with sevetal IF statements and some assignment statements 



s^yeraj 

may be efcgressed as: , 

(a) a set of assignment statements all of*>which may be 
executed simultaneously, C 



(bj a set of Boolean functions all of which may be 
evaluated simultaneously, - 

(c) a binary ^decision tree through .which one path will 
be' followed for each execution o'f the program seg- 
ment, and t 4 

(d) a collection of sets of assignment statements with 
a single variable on the right where e^ach set is 
associated with each 'path through the, tree. 

These techniques have been written as a "PL/1 program and 
applied to a set of 86 FORTRAN programs 7 . Averaged over 
the v 86 programs, these techniques could, use 35 processors, 
resulting ii\ a speed-up of 9.? and an-ef f iciency of 33%. 

- * 
Specific Examples ^ ^ 

To determine the applicability of these, techniques,- we 
examined the code^pf both the" ground-launched missile 4 and 
the-A-10 flight simulation. In neither c'ase could we find 
significant amounts of* code where back substitution could be 
applied. Further f the qode_contained no DO loops that con- 
stituted linear, recurrence equations, "^or were there signi- 
ficant/ IF blocks that would benefit from^reorganizjit ion. As 
Would be 'eSpected, however, Arithmetic expression evaluation 
pravi.d6d extensive- parallelism. For example*, ■ consider ^the 
expression fer 'the variable QDOT — typical of expressions 
from both programs. 4 



QDOT = IYS * J (.(RHO/2) >* - (WS + WZ) t US. * 4% 839 l£ * Ck&) 
- US 8.86989 * RHO/4 ^* XMQ * QS ' / * \ 
< - 21 .1 * RS * PS - LC * FTfc) 



7Ruck, David J. "Multioperation Machine Computational 
Complexity 11 . Complexity of Sequential an,d Parallel Numerical 
Algorithms , J. PJ Traub, Ed: ^(New^York: Academic Press, 
1973 )' pp. 17-4§/ , 



Figure 8 'shows a standard parse tree for this assign- 
ment. It contains 18 arithmetic operations "and could be 
evaluated in six time steps using, five processors/ This 
would result in" ^n efficiency of 60%% Figure 9* shows the 
modified parse tree after associativity f commutativity , and 
distributivity have reduced the tree heights This tree has 
a height of only^ 5 and consists of 19 operations.' The modi- 
fied tree, could be executed in five tike steps using six 
processors. This also results in an efficiency of 60%. 

During this pha^e of the study,, it became apparent that 
applying these techniques to 'a significant section of code 
was beyond the capabilities of manual techniques'. The num- 
ber of" operations required makes it immensely time consum- 
ing, and the probabilities of error would be so high as to 
make the results' suspect. The only alternative is to con- 
struct programs to automatically analyze, the code segments 
and produce parallel instruction sequences.' This, toa, is a 
significant task beyond the scope of the work." Section 5 
discusses the impact and value of these techniques. 




SECTION 3: PARALLELISM AT THE TASK. I^EVEL 



One of the most common methods ,of producing a parallel 
program is to take a sequential program and "parallelize" 
it* * This involves identifying tasks within the sequential 
program and recognizing that those tasks, together with the 
implied flow of control, represent a task system. When such 
a division is ptJSsible, standard techniques are available to 
produce parallel code. * \ 



Task Systems 

•We define a task asa unit of computational activity 

specified in terms of the input variables it requires, the 

output* variables it generates,^ and its execution time. The^ 

specific transformation that it imposes on' its input to 

produce its output is not part of the specification of a 

task. Thus the tasks may be considered uninterpreted . Let 

j = (j 1 f T 2 , • • -T n ) be ^ set of tasks and an irreflexive 

partial order (precedence relation), defined oil Then C = 

(J,< # ) is called a task System . The. precedence relation 

means that if T <• T' then T must complete execution ' 
•* . 
before T 1 . * , \ 

From this definition we introduce a graphical repre- 
sentation, called a precedence graph , for 'task systems. 
T^is consists of a directed graph whose vertices (nodes) are 
^^Uie ta^cs^j'and which has an edge "frdm T to if T <• T 1 • 
A T' 1 sjich that 'T <• T' ' <• T' dpes not exist. Thus the 
set of (e^ljes* in the precedence gsaph represents the smallest 
relation whose transitive closure is,<-. ^ 

Many sequential programs atnd program^ segments can -be 
• viewed as* precedence graphs. Figure 10 shows an example of 
a program segment and its. related precedence graph. Since 
*~ the* relation <• is irreflexive, antisymmetric and transi- 
'« tive, t^ precedence graph is acyclic — -it represents only 
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C**' 



151 



231 

C 

C*** 



4 

c 

JC*** 



c 

c*** 



TASK 15 COMPUTE THRUST 
NDX=ITHRUST-1 

IF (TIME .LT. THRUSTIME( ITHRUST) ) GO TO 151 
NDX= ITHRUST 

IF (ITHRUST .LT. LTHRUST) ITHRUST*ITHRUST+1 

THRUST=THRUSTAB ( NDX ) +THRUSTSL ( NDX ) 
: * (TIME-THRUSTIME ( NDX ) ) 

TASK 23 COMPUTE LT 

NDX=ILT-1 

IF(TIME .LT. LTIME(ILT)) GO TO 231 
NDX=ILT 

IF (ILT .LT. LLT) IL"T=ILT+1 
LT=LTAB ( NDX ) + ( TIME-LTIME ( NDX ) ) * LTSL ( NDX ) 

TASK 29 COMPUTE CMQS 
NDX=ICMQS-1 
IF(TIME .LT. CMQSTIME( ICMQS) ) GO TO 291 . 
NDX=1CMQS. t 

IF(ICMQ&*.LT. LCMQS) ICMQS=ICMQS+ 1 
. CMQS=CMQSTAB ( NDX ) +(T IME-CMQSTIME ( NDX ) ) *CMQSSL ( NDX ) 



\ 



TASK 1 1 COMPUTE TIME 
CALL RK1 1 (STEP) 

TASK 5 COMPUTE QS 
QDOT=$T(70) 
CALL RK5(STEP)* 



( TASK j 


( TASK \ 


( TASK \ 


— M- 


V 15 j 


*V " J 


-H » > 






Pigure 10 - Program Segment and Related Precedence Graph 
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straight line code (or code that can be viewed as straight 
line). We can deal with data-dependent branches that\fall 
entirely within a task, but not conditional branches to 
other tasks. Further, many loops can tie "unrolled" (viewed 
as straight line code) and handled in an acyclic manner. In 
one instance, discussed later, #e can deal with specific; 
kinds of cyclic graphs. 

With each task T we associate two .events: initiation , 
and termination. An execution sequence of an n-task" system 
C « (Jr<*) is any string <s = a 1f a2f*'°2n of task events 
satisfying the precedence relation and Consisting of exactly 
one' initiation and one termination event for each task. A : 
task system that represents a sequential program has only 
one execution sequence; for other task systems (perhaps 
equivalent to the sequential task system)' there may be 
several. 

v To discuss determinant task systems, we must define an 
ordered set of memory cells M = (M-| , , M2 , • • • ,M m ) that repre- 
sents the physical system on which task systems execute. 
With each task T in a system C we associate two, possibly 
overlapping, ordered subsets of M: the domain D T and the 
range Rfj. When T is initiated it reads the values stored 
in its domain cells; when it terminates it writes values 
into its range cells. Given an execution sequence 5 for- a 
task system, we can define the value sequence V(M^ r 6) as the 
sequence of values written by terminating tasks in P for 
which Mi e Before the first event in any execution 

sequence, we expect the memory cells* to contain values, #e 
refer to that set of values as the initial state P Q . 

We can now define, more rigorously -the intuitivd concept 
of determinant task systems: „ 

A task system C is determinant if for any given 
initial State P Q , V(M if 6) - V(M if « ■ ) , 1 < i < m, 
for all execution sequences 4? and * \ . 
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* From 'this definition, it is clear that a task system 
that represents a sequential program is determinant since 
there is only one execution sequence* ' Given two ,t'ask 
systems both consisting of the same' tasks, they are^ said to 
be equivalent if tftey are determinant and, for the same 
initial state, produce the same- value sequences. . 

Our goal now is to define a method by which, given a . 
determinant task system (i.e. one representing a sequentially 
program) we can derive another determinant task system 
equivalent to the first^which has in some sense ftore paral- 
lelism. In fact, our method will derive one with maximum 
parallelism subject to the constraint that w^have no know- 
ledge of thfe internal transformations performed/ by th£ 
tasks. We begin with the following definition: 

Given a task system C, then tasks T and 
( T' are nox\ interfering if Either f 

T <• T 1 or T 1 < • T 0" 



-or- 



Rrp 0 Rrpl = Rrp fl Drpi = Rrp I fl Drp = 0 * 



" We now state, without formal proof 8 , a fundamental 
Theorem regarding noninter fer ing tasks and determinancy: 

Task systems consisting of mutually' 
noninterfering tasks' are determinant*. 

The final development falls naturally from the Theorem. 
Given a determinant task system C = (J, <•) we construct * 



8 Interyssted readers may consult Operating Systems Theory 
by Etfwarcr G. Coffman, Jr. and P. J. Denning (Englewood - 
Cliffs, NJ: Prentice Hall, 1973) • 
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# * 

another task system C = (J,<*') that is equivalent to C 
but whose precedence relation is constructed from <• on 
the basis that* (T,T' ) e <• ■ .only if it is necessary to 
insure that T and T 1 are non k interf er ing. The resulting- task 
system is, by the Theorem/ determinant. Further, it is 
maximally parallel in that any further reduction of the/ 
precedence* relation results in nondeterminancy . Finally, 
since <•* every execution sequence of C is an 

execution sequence of C and, since C is determinant, every 
execution sequence of C produces the same value sequence. 
Therefore. C' is equivalent to C. This is formally stated *in 
the following Theorem: „ 

From a given determinant task system C - (J,<*) 
^onstruct a new system C' = ( J,< # ' ) where <•' 
is the transitive closure of the relation: 

X = { ( T,T' ) < • | ( Rj 0 R T i ) (j (Rj n D T i ) (RipiH D T ) f 0 } 

Then C 1 is the. upique, maximally' parallel* task system 
* equivalent to C. ,N 

* * * 

Scheduling 



Standard Task Systems * 

♦ * 

Given a determinant task system and th^ execution time 
of each task, the problem remain^ of assigning the tasks to 
p processors. More formally, we define the scheduling 
problem to be the following: we are given 

i 

' t 

(1) a set of tasks J = { r<| , T2,...VT r f. 
(•2) „ an irreflexive partial order <•, oh J, 
* (3) a weighting function W from S to the positive 

integers, representing the execution time of each 
of the tasks, and 
(4) the number, of processors p. 

* r 
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We may be executing as many as p £asks at any point in time. 
If task T is first ^ecuted at time t using processor K, ' 
then it is executed only A times t, t+1,..., ,t+W(T)-1 using 
processor K each time. It is also required, for any task T 1 
such that T 1 <• T, that T 1 complete execution at time t 1 
when t' < t. A schedule is an assignment of tasks to 
processors that satisfies the above cortditions and has 
length tmax, where tmax is the maximum, over all tasks, of 
the times at which the termination events occur. The 
scheduling, problem , then, is to determine an assignment that 
minimizes tmax. This problem is NP-complete 9 and can be 
considered intractable. There are, however, polynomial time 
bound algorithms that produce good schedules. One sych 
algorithm is critical path list scheduling . / 

The algorithm is defined as follows: 

(1) Given a task system and a list that orders the 
% tasks, we require a scheduling strategy that 

assigns (to a free processor) the first unassigneS 
task in the list whose precedence . constraints have 
been met. Such -a strategy is called demand list 
scheduling. • 



The critical time of a task is the execution time 
of that task plus the maximum critical times of 
any successor tasks. * 

(3) If the tasks ate ordered on ; nonincreasing critical 
time, then the resulting list schedule Is ^called 
critical path list scheduling. 



9j. D. Ullman. "Polynomial Complete Scheduling Problems' 

Operating Systems Review , Vol. 7 No'.j4 ( 1973.), pp.. 96-101. 
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Kohler 10 reports a preliminary evaluation in which 20 
task systems, scheduled using critical path -list scheduling,.^ 

• produced 17 optimal- schedules. The worst-cas% schedule was 
only 3.4% longer than optimal. Using only limited back- 
tracking with a critical path list scheduler, Lord 11 found . 
that in 100 randomly generated cases, 89 were scheduled 
optimally. • He further fourtk-that for all cases the 
schedules had an expected time of only .36% ionger, than 
optimal. The worst-case time was 5. 6% 3 longer. Thus we 

' conclude-that critical path list scheduling is an acceptable 
technique for practical application. 



Cyclic Task Systems 

As we have observed before, the standard task system 
represents an acyclic computational method. This method 
applies to repetitious calculations such as flight simula- 
tion problems by treating the calculation of derivatives and 
the updating of the state variable as a task system, 
scheduling those tasks, and then repeatedly executing that * 
schedule. In some, cases, however, shorter solution times 
can result if we represent'^'ttTe-^problem as a cyclic task 
system. For example, consider the Van der Pol equation ' 
written as two first-order equations: 



. xi^ * 2 

1 - Y 

1 



x 2 = u( 1 - x J)x 2 - x 1 



10 W. H. Kohler. "Preliminary Evaluation of The Critical 
Path Method for Scheduling Tasks on a Multiprocessor 
System". IEEE Transactions on Computing , Vol. C24, No. 12 
(December 1975), pp.. 1 235-1236. 

E. Lord. Scheduling Recurrence , Equations for S olution 
on MIMD Type Computers' . PhD Dissertation, Washington State. 
•University, 1976. ... . 
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-By using .some suitable integration method (for example, 
4th order Runge-Kutt^ as indicated by the function rk) , the 
main part of a program for solving these equations is as 
follows: _ 



while time < runtime do 

05 



for i < 1 until 4 do 



der, < x 2 

der 2 < u*(1-x 1 , *x 1 )*x 2 -x 1 

xt < rk(der-| >i, 1) 

x 2 < rk(der 2 ,i,2) 



time < time + h 



S * 

The calculation interior to Ihe "for" loop can be 4 
represented by. the acyclic precedence graphshown in Figure^ 
1 1 . Assuming that each binary operation can be executed in 
ooe time unit and that the function rk can -be evaluated in 
four units, the entire "for", loop can be. represented by the 
cyclic precedence graph shown in Figure 12. T3 calculates 
uMI-x^xt), T4 calculates *x 2 -x 1r and T1 and T2 calculate v 
the function rk. 



(•Given two parallel processors, one way to schedule this 
solution is to assign the tasks interior to the "for" loop 
to processors. This should be done in such a way as to pre-, 
serve the precedence relations and yet complete all tasks as 
quickly as possible. The solution to the problem is the 
repeated execution^of this schedule. Such an assignment is 
shown by the Ga#tt chart in Figure tl/ Note - that -this 
assignment is'a-s good as possible — the precedence, graph 
has a. maximum dath lengt'h equal to the assignment period. 



9 
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- - The Gantt chart in' Figure 12 shows the assignments made 
if we assume some initial values for X-j- and X 2 and then 
Assign *the tasks from the cyclic precedence graph while 
maintaining all prepedence constraints. This assignment has 
'a repetition period of seven units, as compared with nine 
units for assigning the acyclrc precedence graph. This 
shorter schedule is the motivation for examining flight 

Vsimula,tion equations to determine their minimum solution 
period and to schedule' them in that minimum period with as 

• few, processors as possible. 

The method used constructs^ task system representing 
the solution to the flight simulation equations, where the. 
tasks that update the state variables are flagged. The 
precedence graph of^the task system is allowed to be 'cyclic 
. so long as each cycle traverses at least one flagged task. 
The minimum solution period is then determined by examining 
all cy6les in "the graph.- ^ ' ] f 

Let the cycles be denoted by C-| , C 2 ,...,C m . Eor each 
cycle let L(Ci) denote .its length and KC^'the number of 
flagged tasks in the cycle.^ Then^the minimum solution 
period t m i n is: 

C^cnin = Max { |L(C i )/#(C i ) | .| 1 < i < m> - 

Once t)ie minimum solution period is determined,, a critical 
path list scheduler can, with only slight. modifications , — „ 
produce an efficient schedule whose repeated execution 
solves the flight simulation problem. ' ' 



Synchronization 



Once a schedule has been determined, there must be some 
way to insure thaf the schedule is followed. A general 
assumption made regarding MIMD computing is that the precise 
execution rate of individual processors cannot be used to. 
prove the correctness of a program. This assumption applies 



v 



to HEP; although we know that execution rates of processes 
are generally the same, detailed knowledge of the progress 
of each* process* is beyond the scope of normal analysis of 
programs. Thus, having determined a schedule -for computing 
the tasks, it now remains to implement it. 

* »■ 

Much of the work on Scheduling- assures, at feast 
implicitly, that someT mechanism external to the processors' 
assigns the tasks to the processors. But our ^xec^tion 
times are estimates onTy, so the scheduling mechanism would 
have to monitor the progress of all processors •> Instead we 
seek a mechanism whereby all the tasks to be executed by a 
single processor are presented as a .sequential^ program, • 
Synchronization primitives/ operating on* semaphores, coordi- 
nate those tasks. ~ 

Dijkstra 12 introduced the primitives P and V/ which * 
operate uninterruptably on- an event variable* termed a 
sem^phor^e, to control resource allocation among concurrent 
processes. For our purposes *we may define P and V as: 



P (E): if.EJ , V (E): 

>hen EH3-1 ] E+E+1 - ' 

el3)e wait 




P is' normally used before a process uses a nonsharable 
resource; V is executed when the use of the resource is. 
, completed. - - r 

Denning 13 fehows that these primitives can 'synchronize 
concurrent tasks. As an example, consider the task system 



12 E. Jf. Dijkstra.- "Cooperating Sequential Proesses" . 
^Programming Languages , F. Genuys, Ed. (New York: * Academic 
Press, 1968) pp. 43-112.. 

13p # ^j. Denning. "Third Ge^ration Computer 4 Systems". 
, Computing Surveys , ,frol . . 3 No. 4 (1971) pp. 175-216. 
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and concurrent program shown in Figure 13. The program uses 
P and V operating on the suitably initialized semaphore X23. 
-Clearly, the program correctl'y executes the task system. 
Since we are using task systems to represent computations, 
■precedence constraints arise because one'task computes data 
elements .used by another task. If we were to consider a- 
task system that represents, a calculation , loop, such as 
shown in Figure 14, we find that the first program s,till ' 
represents a valid solution tp > the^ problem. This is because 
it is implied that both streamer and' stream 2 complete ■ 
execution before beginning the second execution of these 
streams. 

Such methods of computation have been previously 
proposed for handling looped and conditional execution using 
constructs named "fork" and "join".„ But if the alternate 
program executes the task system, then the P and V • * 
'operations , are no longer valid. This^ is so because if S2 
runs more qui-ckly than SI, at some point T2 completes the' 
calculation of the "data element that causes the precedence 

"'constraint before T3 has consumed the previous value. Even 
if- we assume a queye for this data element, in any real ' ( 

, implementation thf queue would be of finite size and henc/T 
subject to overflow. To overcome this difficulty, we use 

.two state semaphores associated with each data element or 
variable, as indicated by: ^ - • 

— 1 ' VAR • ■ , 

2 VALUE 

' 2 ; SEMAPHORE [ •' E ' , ' F ' ] - 

where 'E ' indicates "empty and 1 F 1 . indicates full. We now . 
define the P and V operations as: • 

P(VAR): 

IF VAR. SEMAPHORE = 1 F 1 
THEN VAR. SEMAPHORE <- 'E' 
ELSE WAIT . . 




PARBEGIN 



SI: Tl; P(X23)j^T3j 
S2 : T2 : V(X23)j T5 



PAREND 



\ > 



Pigure \3f - Td?k System and Concurrent, Program 
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I- REPEAT N TIMES; 
PMBE6IN 



SI: Tlj P(X23); T3j T1 
' S2: T2; V(X23)j T5 
PAREND 



\ 



AtTERNATE ' - 
- PARBEGIN. 



SI: , REPEAT N TIMES " 

Tlj P(X23); T3; T4 

END - 

- S2: REPEAT N TIMES • . 

• \ T2; V(X23); T5 
END 

PAREND 



Figure 14 - Task System for Repeated Execution 

J • - 51 - 

56 ' 



V(VAR)': 

j. IF VAR. SEMAPHORE = *'E' 

THEN VAR. SEMAPHORE <-*?*' 
\ ELSE^WAltf » » 

Then i*f we let X23 represent ' tfce variable responsible for 
the precedence constraints from T2 to T3<, the alternate pro- 
gram correctly executes the task system. 

To further simplify the programming aspects of such a 
synchronizing method, -we note that, -in a Language involving 
assignment statements, N context determines whether the opera- 
tion is^ P or V. That'is, any synchronizing operation on the 
left of the assignment symbol denotes. a V operation. All 
others -denote a P operation. * IrfHEP FORTRAN, $ represents 
both P and y; 'context denotes which operation is, implied. 
*If some task T1. computes a value used by two other\task's, T2 
and T3 ("each in separate instruction streams), then the 
coordinatiqfh^prpblem between T1 and T2 is separate from the / 
.coordination problem* between T?1 and T3. Hence, two copies 
of the variable are required so' that two separate semaphores " 
are available. 

/ • ■ • • • 



Automated Techniques 



' t During thfc course of thi? study, we developed and 4 used 
programs to automate many of the steps involved in preparing 
a problem for parallel solution. We believe that Sufficient 
knowledge is available to construct a CSSL-type language 4 ( 
translator 14 that would produce efficient parallel code for 
the types. of .problems we "havfe studied.. But the class of, 
problems so far studied \s relatively small; desirable 



14 See "Continuous System Simulation Language" by J. C. 
Strauss, et. al. Simulation , Vol. 6 No. 12, December , 1967 . 
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extensions to such a language may be poorly understood^ 
Thus we feel that a practical approach for the immediate 
future is to use a set of utility programs that will, signif- 
icantly aid programmers in constructing parallel programs 
yet not impede them in the methodologies they use. In the 
remainder of this section we discuss the various utility 
programs whiclr we feel would be useful and the' sourGe lang- 
uage restrictions they would impose op the programmer. * 

We assume that the definition of a flight simulation 
problem will be extended to a sequential FORTRAN program 
that defines the derivatives' in terms of the state variable^ 
and updates the state variables by whatever technique is 
desired. In general, we assuTne that updating each state ' 
variable is a separate program segment, so they can be 
designated individual tasks where desirable. Since each 
program segment mi}$t be simple enough to be represented by a 
cyclic task System, restrictions on conditional branches, 
will * be Required . Selecting code segment^ fpr'tasks'is not 
unique — we can give only guidelines as to what constitutes 
a good selection. Thus" we require the programmer to specify 
which pieces of code constitute tasks. No branches can 
occur from one task to another. In practice we have found 
that this restriction is not at all severe. 

There aire a variety of methods the programmer could use 
to indicate what constitutes a task. We have used a comment 
card of the .form 

<* 

C***TASK n [,SV] 

to indicate that the following statements constitute task n; 
the option SV indicates that the task updates a* state or 
recurrence variable*. This directive is terminated by 
another task* comment card or by an END card. .Since the 
final maximally parallel task system equivalent to this 
sequeati^J. program is derived from the ranges and domains of 
the tasks, and since range and domain determination is not 
always possible in FORTRAN, further extensions of the 
comment cards are required. Specifically, if the statement 
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, CALL SUBR (A,B,C,) 

is within a /task , there is no way to determine if A, B, and 
C are in the range of this task, the domain of the task, or 
both. Also, it may not be worth the" effort to analyze all 1 
of the equivalence statements. Thus we use comment cards of 
the form s * ' 

C *** RANQE (list) . . 



and 



C *** DOMAIN (list) 



to indicate that all* variables within the list are in the 
range or, domain* of the current task. 

Another piece of information required by automated 
analysis is an estimate of the execution time of each task. 
We have chosen the units o,f this measure to be the number of 
instructions executed within the task* If the code con- 
stituting the task is straight line code, the- number of 
^ instructions is known at compilation time. But if the task* 
contains conditional branches or invokes external subpro- 
grams, the execution time of the task is not usually »deter- 
mina≤ the programmer must supply an estimate. To this 
end we use a comment card of the* form 

C *** TIME n 

to specify the execution time as n machine instructions. 
Note that specifications of range, domain and time are 
required only if the form of the code precludes the utility 
-from determining the values. 

A' final comment card reduces analysis time by listing - 
*local variables that are to be excluded from range-domain 
analysis. This card has the form , * 

C *** LpCAL list 

and indicates that, for this task, variables in the list are 
to be excluded from^ both the range and domain of the task. 

■ 
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This is used mostly for variables that are first in the 
range and* then in' the domain, as would be the case for DO 
loop control* variables. ' 

A PASCAL^program was constructed for this strudy to 
determine range 'and domain.- It soon became evident, how- 
ever, that this program must perform lexical and syntactic . 
analysis of FORTRAN source code just as the FORTRAN compiler 
must do. Therefore/ compiler output was used to determine 
the- task system. This requires, the following: 

A source code image. This allows the extent of 
the tasks to be determined, and comment cards 
pertaining to range, domain 'and time to be 
examined. J v 

(2) An image of the generated machine code. This 
allows executxpn time to be estimated for 
those tasks that consist of 'only straight 
line code. 7 * y 

« 

(3) A cross- reference listing. This allows ranges 
and domains -.of the tasks to be determined. 



If suitable compile options are invoked, all this informa- 
tion is in the compiler oiftput 'listing. . The output of this 
program is the cyclic task system required for a scheduler, 
and the names of variables involved in*intertask communica- 
tion. An additional output is a file of the source programs --\ 
whieh is used to construct .the parallel prograih. y 

The secpnd utility program is the scheduler.. The 
inputs are a cyclic task system, an estimate of the execu- 
tion time, and a specification o% the number of processes. , 
The output JLs. a schedule that is not necessarily optimal but • 
has good efficiency. In test runs, the schedules produced, 
for . 100 randomly generated cyclic^) task systems resulted in 
93 schedules that were optimal. The expected schedule 
length was no more than .158% longer than optimal. 

» 
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-The third utility program uses the output of the 
previous two programs to determine the synchronization 
required to insure that the .schedule Is not violated.,^ The 
basic algorithm examines, for each pair of task's to be exe- 
cuted, in different processes," which variables are in the 
range of one and the domain of the other. For those varia- 
bles, asynchronous (semaphored) variables must be used. 

For each variable it must\be the case that e Rrp 
for some task T. Th& possibi lrties\ f or each variable .are: 



(a) For all tasks T' such that e D T , , T 1 has been- 
assigned to the same processes as T, in which case j\o 
synchronization is required. 

■6 

(b) There is only one task T 1 with e D T i and it has 
been assigned to another processor. Further, if the 
variable name has only one instance in both T and T 1 , 
then 'the variable name is. prefixed with a. $ in both T 
and T' afrd the asynchronous variable is placed in 
COMMON. . • . > - 

(a) There are Multiple tasks T 1 such that; e D T i and 
some of these tasks have be # en assigned to different 
processors than*T. In this case, for each T 1 which has 
been assigned to a different processor we associate a 
new asynchronous variable ^$W. This variable is placed 
in COMMON. The assignment statement $W = V± is placed 
♦at the end of, the code for T, and the assignment 
statement V± = $W is placed at the beginning of the 
code for T' . 

■» * 

Another utility pr6gram that would be useful is one 
that actually constructs the code sequences for the various 
processes based upon the preceding analysis. This would be 
particularly usefjuT in a system witji a flexible text editor. 
This would allow the programmer to add output statements or 
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exception-handling code. This utility has rib firm technical 
requirements; it is more a convenience feature for the 
programmer, 

* * . * 

Results 

* - • 

* 

The methods of fcfie previous section were applied to the 
flight simultibn of a ground-launched missile^ ;The methodo- 
logy employed in programming* the flight simulation equations 
for an MIMD computer can be divided into several categories. 
These include equation segmentation, scheduling and synch- 
ronization. ^ 

Equation segementation takes a representation of the 
problem, in our case a sequential FORTRAN Program, and 
identifies- the tasks. These tasks are considered to ke 
individual, computational activities, and could range from 
individual machine instructions to groups of FORTRAN state^- 
ments. We chose individual statements or small groups of 
statements, wher^e any branching took pl^ce entirely within 
the group of statements identiliied as a task. An example of/ 
this, task selection is shown in Figure 15, which shows a 
portion* of the sequential code and indications of some 
specific tasks. In thia case a tot&i* of ^40 tasks wer£ \ 
identified. Ten of their) , update the state variables by the 
chosen integration method, and 'one updates the independent 
variable, time. The remaining 29 tasks are associated with 
evaluating the derivatives. '** x ~ - ~— " - • 

The next step was to estimate the execution time of 
each task.' Since the HEP computer executes all instructions* 
in the same time, this involved -compiling the program and 
"counting the number of machine instructions generated by, 
each task. The number of instructions jper ,task rangdd from- 
2 to '88, with an average of 34. 6, We next determined a, 
maximally parallel task system equivalent to the set^ of 
tasks 'selected, and the sequential program for thosSe tasks. 



0127 0 

0128 C*** TASK 17 COMPUTE ACDQ 

0129 NOXsIACDO^l 

0130 iFvlTIHE »LT» ACDOTMi lACDO ) ) GO-TO ^171 

0131 NDXsIACUO 

Q132 IF( IACDO .U . LACDD) I ACOo~IACDO+l 

.0133 171 AC00sAC0OTB(NDX)+(TIME-ACD0TM(NDX) ) 
013<+ ; ♦ACOOSLtMDX) 

0135 C • ■ » * 

0136 C*** TASK 18 COMPUTE UDOT 

0*137 UU0T=RS«VS~WS*QS-32»17*STHETA+MASS* 
0136 : ( THRUST-WHQ/2* ( US+WX ) **<US<mX ) ♦ACDQ ) 



0139 C*** TASK 19 CO'^PUTE FTY . FjZ 

01^0 GAnTHE=(THETA-THETA2)*COSPHl*(PSI-PSIZ)*SINPHI 

Oim GAMPS? = (PSI-PSIZ)*P0SPHI-(THE*TA-THETA2 , *SINPHI^ 

01*42 . - FY = S / 4'4l*GAMP'?Jt 

01*3 . IF (ABS<FY).LE.380)' GO TO 35 

01*** FYxSIGN(380.tFY) 

01*45 35 F2=8Hm*GAMTHE 

01^ IF <ABS<FZ).LE.380) GO TO 36 

01*47 FZ=SIGN(38o.*F2> 

01*48 36 CONTINUE . 

0H9 t>TY=FY*C0SpHI+F2*SlNPHI ^ 

Q150 FT2=F2*C0SpHI-FY»SirjPHI * 

0151 C*** TASK 20 COMPUTE ACNAPH " " 

0152 IF (MACH .LT. ACNMH(IACN) ) GO TO 201 

0153 NDXrIACN 

015H ZFCIACN .LT. LACN) IACN=IACN+1 

0155 GO TO 203 

-0156 20i IF (UACH .GE, ACNMH < lACN-1 >> GO TO 202 

0157 * IF. CIACN .GT» 21 I ACN?lACN-l 

0156 20? NDX = IACN-1 * ' * " ' • * 

0159 -2O3 ACfJAPH=ACNTAB (fiDX") + ( MACH-ACNMH (NUX ) > 

0160 : *ACNSL(NOX) 
0161 



0162. C*** TASK 21 COMPUTE VOOT 

0163 VD0T=MASS*(FTY-RH0/2*US*ACNAPH*(VS-WY) ) -RS*US 

01b«* C " ' 

0165 C*** TASK 22 COMPUTE WDOT § 

C166 WU0T=QS*US+32.17*CTHETA+MASS*(RH0/(-2)*US«ACNAPH^ 

0167 • :*WS4W2)-FTZ> 

01^8 C*** TASK 23 COMPUTE LT 

0169 N0X=ILT-1 

0170 IF ( TIME #t-T. LTIME(ILT)) GO TO 231. / 

0171 ' * NOX=ILT ( 4 

0172 IF (ILT .LT. LLT) ILT=:ItT+l \^ . 

0173 23 t LT=LTAB(NDX) + (TIME-LTIME(N0X) ) *LTSL f NDX ) 
017*4 C 



Pigure 15 - Task Selection 
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16" shows £he task system' for the 40 t^sks com- 

c ^ problem^ solution. The task number ' and 'execution 

time (in machine instructions) are within the nodes. All 
ar$s go left to right. Observe that the three ;tasks 
highlighted in Figure 1*5 (tasks 18^-19 and 20) can all be' 4 
executed in parallel. . 

- * 

Before scheduling, we make, a transformation on the 
parallel task system to shorten the solution time. In 
Figure 16 we see that the longest path traverses" nodes 7', 
39, 19 and 31, and has a length of 212 units. This^path 
does not "determine minimum execution time, however, because 
there is no path from node 31 to node 7. The minimum time 
is instead .determined by. the cycle traversing 7, 39, 1$, 3,2, 
6 and 3, which has a length of 252. This yields a minimum 
execution time (for n iterations) of n * 126 + constant. 



The-next steps in our methodology were scheduling the 
transformed task system for execution on-p processors and 
synchronizing* Figure 17 shows the resulting schedule. 

The schedules for the flight simulation problem were 
^rogr.ammed-fo'r HEP FORTRAN and were executed^on the HEP. 
Equation segmentation, in conjunction with the^fojjrth-order 
Runge-Kutta formula given by ( RK4-) was used for the 
^eight-processor schedule shown in Figure 11. The computa- ^ 
.tions of ttfe integration formula were also done as parallel ; 
tasks. This scheme was also programmed "using six proces- 
sors; the speed-up- was 3.98". The speed-up and efficiency of 
the e_isht processor program, along with the computational 
results, are shown in Table 4. * 



PROGRAM P * T 1 T p S p " E p 



RK 



-2MB '4.87 5.78 .72.3% > 



Table 4. Speed-up & Efficiency, Eight Processors 
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Figure 1 6 - -Maximally Parallel Task System 
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Figure 17 - A Schedule Using Eight Processors 
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/ Subsequent 1 analysis has shown that the speed-up shown ' 
in Table 4 can be increased to 7.0< This is accomplished by 
reducing the . amount 'of -synchronization . ^he following 
'analysis -indicates* the efficiency losses in the solution. ,. 
Let : / - 

A = number of cycles required by actual 

computation, ' m 

B = number of cycles required by the best 
^ ^ ' schedule, , 

c = 'number of cycles required by synchronization/ 

For the eight-processor scheme with'RK method, the value? of 
A, B and 0 are: 

* ' . • . • , 

< . . A = 1384/8 =-173 cycles • 

- B = 192 cycles = 10.9% of A 
• J - C ='(.78 '+ 2)/8 = 19^5. average . ' J 

•and C = 23 for wo*rst case = 11. 9%. of B ♦ 
\ ' - ' ' * ' 

* - f 

The total number" of . cycles is -then given by, 

'* » Cycles" =- A + (B,- S A),+ G 

/ 4 = 1 73 • + 1 9" + , 2> " • 

— . '= 215 , \ . \. 

The predicted .solution time, is" given by 

PST' = Cycles x 28,00.0 x .8 x 1 0~ 6 = 4.816 seconds 

* • • * 

Cditipare^ this to th^_-ajjCoral " solution time, g'iven by Tabie 4, 
of 4 .87 -seconds ^ • • 



Although 'execution of the A-10 code was impossible, as. 
discussed in 'Section 1, some analysis was' ^Aade of the four • 
subroutines ZM6SF101, ZM6SP1.02 , . ZH6AF103 and ZM6S.P10E . 
These subroutines were treated as. a single code segment and 
Were divided Irfto 50 tasks. -The total execution time of the 
tasks was 2803 machine instruc€ions, or 2 . 2,42' mi-lli seconds . 



on HEP , using a single processing 'stream. Since the execu- 
tion speed on HEP considerably exceeds the requirement 15 , 
large amounts^f^paralielism did not seem required. We did, 
however, schedule the task system for maximum speedup, 
which for this system resulted * in a solution time of 671 
machine instructions or .537 milliseconds using five 
processors. The efficiency of this solution is 83.5%, but 
does hot' include synchronization requirements.* 



• • * . 4 • ' 

% 15 33.3 milliseconds (30 tim^s per second), as specified by 

« * • , H & 

comment lines , in the subroutine ZM6SPT01. 
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' ° . SECTION 4. REORGANIZATION 



f 



* 



Thi$ section deals with' problem-solving^ by alternate 
methods that are either inherently parallel -or* lend them- 
selves to parallelizat ion. Specif ically, we cover the pro- 
blem of solving ordinary differential equations and exatnine* 
J^wo exajnples in the general area of mathematical library 4 
functions . ♦ • 



Parallel Techniques % for Ordinary Differential Equations 

' / ■ T 

General Methods 

This section deals with parallel' methods for ^solving a 
set of rr ODEs denoted* by S\ 



where 



. y'(t) - f(t, y(t)) , y(i 0 ) = ~Yo ' < 1 > 



< tg r t e Yq e R n f y : R -> R n , J£ : R x R n -> R n 



Most methods that solve (1) generate approximations y n 
% ; to y(t n ) on a mesh a.= tg < t<j < t 2 <...<Jb N .= b. t These v 
a*re called step-by-step- difference methods. An r-step dif- 
ference method is one that CQmputes y n +j Using r earlier 
values yh'^Yn-1' Yn-r v +1»- This numerical integration of 

v , ( 1 ) .by fini.te differences is a sequential calqulation. 
J Lately, several authors have addressed the question of using 
* -srqme. of 'these .for mul*as. simultaneously 'on a set of arithmetic 
* processors to increase the iategr action speed. " * 1 



Interpolation Method 

Nievergelt 16 proposed a parallel form of a serial 
integration method to solve a differential equation, in ; 
which the algori-thm is divided into subtasks that can be 
•computed independently. The method is'-as follows: 

(1) Divide the integration interval [a,b]^ into N equal 
subintervals [tj_i» t^] , t Q = a, t^ = b, ^ 

i = *1 ,, 2 i 3 i " • • . t 

(2) Make a rough prediction J{ of the solution y(t^) 

(3) Select a certain number M'i of values Y i j , j # - 1? 
... , Vl i in Ave vicinity of yj 

(4,) /integrate' simultaneously & ( with an. accurate 
k integration method M) all the system 

•4 

y«' = f(t, y), y(t 0 ) = y 0 , t 0 < t < v 
y - = f(t, *yW y(ti) = Yij , t A < t < t i+1 

j = 1 , . . ,M if i=1 , . . fN-1 

The integration interval [a,b] will be covered with 
lines of length (b=a)/N, which are solutions of <1) but do 
not join at their ends. These branches are connectedly 

interpolating, at t, , .t 2 Vl' the previously 

found solution over the next interval, to the right. The 
time of this computation can be represented by 

t = 1/N (time for -serial integration) 

P 1 0 
^ + time" to predict 

+ interpolation t<ime + bookkeeping ■ time 



16j Nievergelt* "/at a 1 lei 'Methods for Integrating 
Ordinary' Differenii/l Equations". Journal of .Computer and 
System Sciences, 1973, pp. 189-198. % 
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Interpolation can be.done in parallel. If we assume 
thaluthe time-consuming part is realty the evaluation of 

. f(t, y), the other contributions' to the total time of compu- 
tation become negligible. The speed-up is roughly 1/n. But 
to compare this method with serial integration from a to b 
using method M, the error introduced by interpolation is 
important. This error depends on the problem, not on the 
method. For linear problems the error is .proved to be 
bounded., but for^ nonlinear problems it may not be. Thus 'the 
usefulness of' this method J.s restricted to a specific class 
of problems, and depends on the choice of many. parameters 

^Like Yl, M i# and. the method M. • '' 



Runge-Kutta (RK) Methods 



The general form of an r-step RK method, the integra- 
tion step leading from Y n to Y n+| , consists of computing 

K i - H n f <v y n > ' 

Ki = h n f(t n + v ai h n , y n +.b i:j Kj) . 

Yn+1 = Yn + R i K i" '<■''' 

.V ... * < 1 

with appropriate values of a's, b's, and R's. A classical 
four-step serial RK. method is I \ 

K 1 = h n f (t n ' ; yn) ' * ' 
K 2 = hf(t n *+ h/2, y^ + (1/2) K1 ) ■ . 

' K 3 " hf < fc n '+*h/2, v y n . + (1/2)K 2 ) (RK 4) 
K 4 = hf(t n + h, y n + K 3 j 

; * y n +1 = y n + + 2K 2 + 2K 3 + K 4 ) ' * 
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• fc ~- 'Mirajiker and .Liniger 17 considered Runge-Kutta formulas 

- "that can t?e~ used in a parallel mode. They introduced the • 
^ concept of> computational front for allowing parallelism. 

* Their parallel second and third order RK formulas are 
derived by a modification of Kopal's results 18 . The 
parallel' schemes have the structure: . 



1 

first order: = h n f(t n , y n ) 

yl = y 1 + K-| ; 

second order:. K?"= K = h f(t , yh 
« 1 1 _n n n 



(RK1) 



1 <2 

K ? * h n f(t n + ah n , y n + bK,) (RK2) 

. y 2 =. R 2 K 2 + R 2 k '' 
n+1 11 -2 2 



third order: K 3 = Ki 

K 3 = K + 
2 2 * 



K 3 = h n -f(t n + ah n , yl +-bKi + CK2) (RK3) 

, y3 * = R 3 k 3 >+ R 3 K 3 + R 3 K$. 
n+1 ^11 ' 2 2 3 



17 N . L . Mi ranker and W. M. Liniger. "Parallel Methods for 
Numerical Integration of Ordinary Differential Equations/. 
. Mathematical Computation , Vol. 21 (1967ft pp. 303-320. x 

1 V 8 Z . Ropal. "Numerical Analysis with-Emphasi-s on The 
. Application of Numerical Techniques to Problems of 
Infinitestimal Calculus in Single Variable". Wiley, New 
York: Chapman & Hall 'London, 1955 -M R 17, 1007.. 
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The parallel character' of these formulas is based on 
the fact that RK i is independent of RK j if and only if 
i < j / i / j % 1 / 2, 3. This implies that if RK1 runs one 
step ahead of RK2 and RK2 runs one step ahead of RK3, then 
(using Kopal's values of R) the parallel third order RK 
formula is given by:" 

K . ■ hf(t n+9 , v 1 ) 
' ?,n + 2 n > +2 ' y n +2 ; 

/•■if' /' 

y n« = y „ + 2 +K ,,„ + 2 .. (SRK3 » 

(* , = h£(t + ah, y 1 + aK ) 
2,n+1 n+1 n+1 1,n+1 - 

* 

? 2 '% m y 2 ^ + d-1/2a)K + (l/2a)K 

n+2 n+1 1,n+1 2, n+1 



6 



K 0 = hf(t n t a lh, y 2 + (a 1 - 1/6a)K + (1/Ba) K ) 
3 ' n P ' 1,n 2,n 

Y 3 ^, = Y 3 + [(2a! - 1 ) /2a] ( K - K ) + K 
n +1 n 1,n 2,n 3,n 

where " . 

a = 2(1-3a 2 )/[3(1-2 ai )] . • 

* 

One valfte-of "a" suggested by Kopal is 1. This gives 
ai = 1/2 + 1/2/3. The above third-order RK formula 
requires three processors to* compute the three function 
evaluations in parallels 

» « 
The -main drawback of (PRK3) is that it is weakly 
stable. Miranker and Liniger (1967) show *hat the scheme • 
leads to an error that grows, linearly wi£h n as n -> w an<3 h 
-? 0 for rt n = nh - constant, ^his problem is due tt> the 
bas^c nature of the one-step formulas with respect to their 
y-entries, which are the only ones thait contribute to the 4 v 
"discussion of stability for h -> 0. 
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Predictor-Corrector (PC) Methods , 

The serial one-step methods 5f the 7 Runge-Kutta type are 
conceptually simple, easy to code; seVf -star ting and numer- 
ically stable for a large . class ' of problems-. -On the other 
hand, they are inefficient; because J>& their one-step 
nature, they do not make full use of .the available informa- 
tion, and their numerical stability does not extend to their 
parallel mode. It seems plausible/ that more accuracy can be 
obtained if^fcU value of y n+1 is made to. depend not only on 

y n but also, say, on y n -V ?n-2' , • •** ^V 3 f n-r1 ' f n-2' 

For this reascmr-flmttrrs^ep methods have become very popular. 
For high accuracy they usually -require less work than 
one-step methods. Thus,, the desire to obtain parallel 
schemes for such methods is reasonable: 

* 

A standard, fourth-order, Adams-Moulton serial 
predictor Corrector (SPC) is; ^ , />v 

■ / 

/ 

V = yc + h/24(55f<r -59f.c + 37f<= - 9fC ) - .(SPC) 
1+1 i ^ l\ - . l-l , .1-^ * 

' ; y o +i . v c t h/24(9/p +1 ♦ 19£« - ?£«_, ♦ fl_ 2 ) J . 



9 ' 
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The computation/scheme (called PECE) of one PC step to 



'calculate y^ + i is 



1. 'Use the predictor equation to calculate an initial f 
approximation to yi+1- Set i = 0. * 

/ " " . JP ■ • \ • 

2. Evaluate thd derivative function 



Use the cc/rrector equation to calculate a better 
approximation to y\+-|» x « 



4< Evaluate/ the derivative function f? +1 



/ 
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5. Check, the termination rule. If it* is tfot time to stop, 
increment i,, set .y i+1 = y c r and return to 1 . 



/Let = total time taken- by function evaluation dyne for 
one step of PC. 

T PCE = time taken to compute predictor-corrector 
equation for. a single equation. 

• > 

Then the time taken by one step of SPC 'is 

. ■ v 

Ti = 2(nT + ) . 
1 PCE "V 

*' Miranker and Lj.nigerj ( 1967) developed formulas* f^r the 
PC method in which the corrector does* not depend serially on 
the predictor, anA the corrector calculations can be per- 
formed simultaneously. This Parallel Predictor-Corrector 
<PPC) operates in a PECE mode, and the calculation advances 
s >^teps at a time. There are 2s processors and each proces- 
soryperf arms either a predictor or a corrector calculation. 
This scheme, is shown in Figure 18. %t A fourth order PPC is 
given by:> jfc 

^? ^y? + h/3(8fP - 5f£ + 4f c - - f c ) (PPC4) 
V i+1 i-1 i * i-1 * i-2 i-3 • 



( y c y c + h/24(9fP + 1 9f°* : - 5f c + f c ) 
i i-1 / I 1-1 i-2 , i-3 



Thus the parallel time fQr a single step of (PPC4) is given 
by * ; ; 



P PPC ■ nT PCE + T f + . 3nT DC + 2T < 



1 *"V . 
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Predictor 



Corrector 



Compute 
Derivatives , 


"K 

Update 

State > 

Variables 




/ 


Compute 
Derivatives 


Update 

State 

Variables 



i-1 



Figure 18 - Parallel PC Scheme 



/ 




t • , = T*'as defined before— £fhd 
i PCE t- ' 

T DC time taken for data co ]^ uni 



cation 



T • = time taken for synchronization. 

• S 



ho 



1 



h ^Generally, the higheV accuracy and fewer function -evalua-^ 

tions of PC methods (as compared to RK method s)^are obtained 
at the cost of increased complexity and /^sometimes, numeri- 
• cal instability. -The parallel RK metho&Lg*ven by Miranker 
and Liniger (1967) do not inherit the stability of their 



- 71 - 



9 

ERJC 



7-8 



( 



serial counterparty* On the other hand, P9C methods in 
Mir anker and LiMger, as described above, are as stable as 
their serial formulas. This is proved by Katz et.al. 19 . 



Block-Implicit Methods 

' ' > ' # ♦ 

Sequential block implicit methods as described by 
Andria et. al. 2 ^ and Shampine and Watts 21 produce more 
than one approximation of y at each step of integration. 
Shampine and Watts and RQsser 22 -discuss block implicit 
methods for RK and PC type schemes. A -two-point, fourth- 
order PC given by Shampine and Watts is: 



r 



19 N. Katz, M. *A. Franklin and A. Sen. "Optimally Stable 

Parallel Predictors for Adams-Moulton Correctors". Computing 

* and Mathematics with Applications ,, Vol. 3, ( 1977), pp. 
217-233. - 



20f. D. Andria,, G.' D. Byrne and d! R. Hill. "Natural, 
Spline .Block Implicit Methods" . BIT, Vol. f3 ( 1973), pa. 
131-1*44 , 



♦ 



7^0 



21 L. F. Shampine and. H . A. Watts. "Block Implicit One Step,^/ 
Method^". JfethematicM Computation , Vol. 23 ( 1 964 ) pp." 731- 

22 JN R^ser. "A Runge-Kutta for All Seasons". SIAM' Review s 
Vol. 9: {Th$L&. 1967), *pp. 417-452. 7 
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1 

• ' vP - 1/3(y c; + y c + y c ) + h/12(29f c -72f c +79f c .) 
y i,+2 1-2 . i-1 i i-2 - i~1 1 i 

* » * * 

* " t 

y e -6 y c + h/12(5f? + 8f? , - tfj (BPC) 
- /i+1 i ' i. i+1 i+2 ' . 



. * 

Wor-Iand 23 describes the natural way W parallelize 
(BPC) using the number of procesors"= number of* block points 
by the schemes shown in Figure 19'. The parallel timeyfor 
one Block calculation given by Franklin 24 cis: , 

T BPC = (2nT PCE + 2T f + 6nT DC + 4T S )/2 . 

Franklin-, also gives a performance comparison of (PPC) and 
parallel (BPC) methods in case of two procesor«. • * 



23 Pt B . worland. "Parallel Methods for the Numerical 
* Solution of Ordinary Differential Equations". IEEE , 

Transactions on Computing" , Vol. C-25 ( October ,' 1 976 ) , pp. 
1,045-1048.' 

. ■ ■« • 

• 24 M . a. Franklin. "Parallel Solution of Ordinary * 
Differential Equations". IEEE Transactions on ^Computing , ' 
..Vol. 0-27, No. 5 (May, 1978). 
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p 

Yi+U 


P \ 
fi+1 X 


c 

Yi+1 


c 



p 

Yi+2 



» c 
Vi+2 



c 
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Pigure 19 -* Parallel Scheme for BPC 



Results 

For implementation, we used the parallel predictot- 
corrector method in conjunction with the techniques 
described i% Section 3. For comparison, we also included 
the results of the ;$unge-Kutta solutions* 

The schedules for the flight simulation problem discus- 
sed in Section 3 were programmed using HEP FORTRAN* and were 
.executed 'on the % HEP parallel computer. £he computational- 
results are shown in Table 5. The sequential ■ times T-j and 
the parallel, times T p with p processors are given in, terms 
of seconds. For comparison, the times ^for the Runge-Kut'ta 
method described - in ^ection 3 are calso ^included . 



) . 



> 
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PROGRAM P T 1 , T p S P E P ■ 

RK ' 8 . 28.18 4.87 ' 5.*'78 72.3% 

•PC 8 ^21 .59 * 3.33 • 6.48 81% 

i * 

TABLE 5 - Speed-up & Efficiency; Predictor-Corrector 

and Runge-Kutta Methods 



The four-processor schedulers run in combination wi 
the parallel predictor-corrector formula given by' (I>P-C4. 
The program cheated eight instruction streams in parallel, 
four for predictor and four for corrector iteration. The 
achieved speed-up and efficiency in this case, as compared 
to the r serial program, is shown in Table 5. Since the 
Serial PC methods are expected to be more efficient than 
Serial RK methods, the difference in speed-up of their 
"parallel mode is also to be expected. On the other hand, 
the -data communication and synchronization in parallel 
predictor-corredtor is more than the method using. the RK 
formula. These calculations are done in^the following* 
analysis of the loss of the efficiencies in both programs. 



Let 



A = number of cycles requii^ed by actual computation, 
B = number of cycles required by the best schedule, 
C = number of cycles required by synchronization. 



For the eight-processor scheme with the RK method, the 
values of A, B, C^are: 

A= 1384/8 = 173 cycles 

B*= 192 cycles = 10.9% of A 

C = (78 + 2)/8 =19.5 average 

and C = 23 for worst/case = 11.9"% of -B 
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The total number Jpf cycles is then ffiven 



Cycles = A '+ (B - A) + C 

? 173 + T9 + 27. =215 




Thfe predicted solution time is gixfen by 

v P^T '« Cycles x/28,000 x .i- ■ x 1(T 6 4.816 seconds ■ 

/ ' « / " 

where the actual Solution t/me given by Table 5 is 4.87 
seconds* 



For the four 
(nd C are: 



-professor's 

1 * • 



PC method, the values of A, B, 



A- = 1384/4 = 346» 

i B = ,363 = 4.*9% of A ' 

C = (86 x 2)/4*+ 50/.8 = 55 ./5- average 

and C = 58'in worst case = 15.9% of B. •* 



This, -gives the s total 1 ' number or cycles required by the 
program n I * 



Cycles = A + {£ - A)*+ C * -\ * . • 

= 35£ +M 7 + 5 $ = 421 cycles # . 

Tftis gives' the predicted /efficiency for the PC method 

';...*/ • / ■ • . ' 

PE = 356/421 = 82%, . * 

* / ~ 

where the ..actual 'efficiency gitfen by Table 5 rs 81%, 



Mathematical, Functions * 



^ In addition to^ the .teorgartization of differential equa-\, 
tions, we have *examined^ the reorganization of two. common 
'f .motions of a mathematical library. In the case of differ- 
ential Equations, the reorganization resulted in di-ffererrt' 

& - - 
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algorithms being employed, whereas, in the cases we are % aboj^t 
to discuss the algorithms*are identical but the programs^ are 
considerably reorganized. , * - 



^Shortest Path Problem / 

f 

Shortest path problems' are among the most fundamental 
and commonly encountered problems in transportation and com- 
munication networks. We included such a problem in this 
study for three ^reasons.. First, it is directly applicable 
to flight simulation studies as a mathematical utility 
function. Second,* it is used to schedule algorithms for 
generating MIMD programs that solve ordinary differential 
equations. Finally, the techniques used \o derive paralJLel- 
^ism clearly show the limitations of automatic detectiqn of 
~ parallelism. • We elaborate this third point in Section 5. 

\ j » 
The shortest pa.th problem we Examined was thg all-tp- 

all program: given n nodes (points, vertices, etc.) and 

give/i a distance (cost) between each ordered ^>air of points, 

determine the 'minimum distance (or cost) and path between 

all pairs. of nodes. The distance or cost 'function does not 

require t,he' distance from i to j to be the Scime as the 

distance from jto'i. Further, the triangle inequality is 

not required to be satisfied. Finally, the distance values 

may be- negative so long as there are no negative cycles. 

Such a problem coald well be stated as: gj^en j^^ufmber 3|f 

locations (latitude, longitude and altitude) and giiveh the \ „ 

fuel consumption of an Aircraft between all adjacent pairs 

of points, what is trie 1 minimum fue^jpon sumption between some 

given point and any other point? ' ^ 



/'The literature contains well over 200 papers on short- 
est( path algorithms 2 ^.'' We chose Floyd's algorithm 2 **, ' 
which is, very general arid' provides the minimum path as well 
as the cost of thatVpath. The nodes of the* graph are 
represented 'by = the integer^ 1, 2,/.., n and the , path length 
(cost) is 'represented by art n-by-n matrix W where is 
*the distance , from node . i to node j. Node j must be adjacent 
, to i (otherwise W^j has the value °°). The sequential 
algorithm is show'n in Figure t 20. a For the algorithm to 
produce the paths as WeXf as the shortest distance, we, need 
a second* n-by^n matrix Z (of ten* referred to as the optimal- 
goligy matrix ") where Zj^j i3 initiated to j if W^j '¥ «°° 
and. zero otherwise. ^During execution* of the innermost loop, 
-if it is found.that Wj ^ + is less than W j ^ then 

(in addition to replacing Wj*^) the value of Aj^-is 
replaced with the current £raiue of Zj^. Upon completion 
of execution, the shortest path'from vertex a to vertex b is 
« de termined 'by the vertex" sequence: 



' v 1 ■ z a,,b 



b = 2 V i , 



I ■ - 
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* 25 N<t>£© and C. Pang. Shbrtest Path Algorithms: / 
Taxonomy arid Annotation ^ Technical Report No. CS-80-057, 
Computer. Sciepce Department, Washington^ ptate* University, 
i Pull-man, WA (March, 1980V • 

. 2b R. W. Floyd. ''Algorithm 97c % Shortest Path". 
Communications of the ACM ,' Vol. -5 J 1962) , p. 346.- \ 1 

______ ■ ... ; . ■ . ; * ^< §5. ^ 



PROGR^i MINPATH 



1 1° 



READ N,W - 

FOR I = 1 TO ^ DO 



FOR CP = 1 TO N DO 



FOR K = 1 TO H DO 



IF Wj/i + W ifk < W jfk 
THEN 



W j,k <" W j,i + W i,k 



WRITE W 



'. '-•< / 

Figure. zO - Minimum Path Algorithm 



To^ determine a parallel version of this algorithm, we . 
define the code in the mbst-int'er ior lpop, to be a task and 
denote it as T^. The algorithm requires that the execu- . 
tion of T A j k be complete ksfore starting- execution bf T uvw „ 
(denoted by ; T i: j k ^T uvw ) if ijk'precedes uvw in the' £ 
natural lexical order. We now determine the maximally 
parallel task system equivalent to the t<sk system of the 

.-Sequential program by examining the* range and dbmain 6f the 
tasks Tijfc-. We denote fW-range (memory locations that ate 

'* written .into by Tij k > by R i: j k and the domain (memory cells, 
read by Tf ijk ) by D ijk . ' ■ # M / _ 



Notje that . . s 

t 



w 



S 



V 



ft 
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For determining the range, note that trie problem requires 
that there be no -negative -weight cycles (if this toere the 
case- there would be no "minimum path for any nodes. within the 
negative cycle). , Thus, if i = $ ,then 

Wj,k < w i,j + ' w i,k 
and it 1 =1 k then 



Thus 



Wj,*: < W jfi + W ifk 



* 



R ijk = ./ w j,k if i ^ j >nd i. f k_ . 
" 1 0 otherwise 

.* 

' ■ . . . . "v- 

From, the - above rangfe- an A d domain we observe that give$ 
distinct 9 T, T 1 where - * 

T, T 1 e Si = {Tij k |.1<3<n, 1<£<n } 



then 



n R s = 0 



4, 

,and 



Consequently, fftr each value qfH^^a-11 tasks in Si may be 
executed in parallel. Thus: 

PROGRAM PARP^ATH' * * 



- for i 


<- 1 to n do . 






for all 1<j<n and 1<k<n, 
do concurrently 








S<- W-h + WjV 
if S i i then 


* 

< 






. w ij- <- S - 





is a.,correct program* that produce? the same results as the 



sequential program MINPATH. ^ The parallel version- 
processors resulting in a speed-up on n 2 cir,^ n2 



can use n 
Since n 2 proces- 
sors.may be larger th^n the number available , in typical 
"systems, we programmed this parallel algorithm^ for use,, of K 
processors where k K is 4 in the range of 1 to n. .This program 
is shown ip .Figures 21 and<*22. « ■ 

The memory. limitations of the jprqtotype HEf> limited us 
to a matrix size of 40 i 40. We ran. randomly generated test 
cases for this si'zfe usfng from- 1 to 14 prcpcesses. The 
results, are* shown in Table 6. The agreement between actual 
afid* predicted efficiency is .good / When j^he numbed of " * 
processes *(P) is a divisor of the dimension (N) of the 
matrix, the efficiencies are excellent. * * * 



Linear Equation Solver 



Solving a set of linear equations is a problem of 
"central importance . Nearly^ every mathematical library 
contains programs for its solution. One <?f the mpst' popular- ^ 
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P. 


> 




N • 


> 


CREATE .STREAM 1 ("1) ^ 
CREATE STREAM 2(2) , 
CREATE STREAM 3(3) 
CREATE STREAM 4(4) 
CREATE STREAM 5(5) 
CREATE STREAM 6(6)*' 
CREATE STREAM 7 (7) 
CALL STREAM -8 (8) 
WRITE W. 
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Figured 1 — Maih Program for Parallel Path 
Using Bight Processes 
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Figure 22 f Subroutine for Parallel Path ■ 
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Solution 'game 



Efficiency i 
Achieved 'Predicted 



1 - • 1 .3*102 ' • ' 

2-' •' .65408 1.0 * 1.0 



{ 



3, -U^- .45176* . .966 v - .952 

4 ' " .32808 / \ -998 , 1.0 



S .26258^ > ..997 . 1 ;0 ■ 

S '' , .'22749* .959p. . ' .952" 



13 ' ./oi7 • .779 .769 

' 14- ^ • * 1.7582 ' .931 .952 




7 .19565 .956 .952 

8 .16524 / . .99 . ^ .0 ^ 



9 • ' .18217 .898 * . . % .889" 

f 

10 „* - .16573 .988 " ^ 1.0 

11 .18099 . .905 .- .909 

12 * .19492 .84 . .83 



< 



TABLE 6 - Performance o£ Parallel Path Algorithm 

" . : » • - \ . 

' . • f v • . . . 
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algorithms is* LU decomposition using Gaussian elimination* y 
with some form of pivoting. We address or^ly partial 
pivoting, . Derails of this algorithm can be found' in any 
standard text on numerical analysis, such as Introduction .to 
Numerical Analysis 2 ,7. Figure 23 shows a serial program, for 
Lp decomposition,, Our method of reorganizing is to unroll 
the DO -loops, apply the techniques of Section 3, schedule 
the resulting parallel system, and finally .write a number of 
subrouting, employing DO loops wbose parallel execution is 
equivalent to the original program. 

The tasks we have selected are indicated in Figure 23. 

Tb^y consist* of the code segment that works on a particular 

Column j for a particular value^ of k. We denote those tasks" 
by » , 




{ J = {TJ 

: k 



| I < K < j < n, K < n « I ) * ^ 



4 

THe precedence constraints imposed by the sequential program 



are 



< ; = [ £ (T?, T 1 } I j<l or k<m]'. 



Thus, C = (J, <•) is the task system that represents the 

sequential program. Th'e range, and domain of these tasks 
are: 

/^r(tJ),- {A(i,j) ' I k<i<p } . ■ V 

• * ' D(T3) = <A(i,^ |-k<l<hj u [A(i,k) | k<i<n ) . \ 



• 27 Fi> B> HildebrancJ. (New York: McGraw Hill, 197^'). 



Program LUDECOMP (:A(n,n)).. 



,For k- <•* V to'n-1 do 



Firid~"j such that* 
■ 

|A(j,k)|, = max { |A(k,k) | , .. . , | A(ri,k).| }. 
PIV(k) <- -j [p^Vot row] 

A<PIV(k) ,k> <-> A(k,k} 



For i <- k+1 .to rt do 



A(i,k) <-,A(i,k)/Mk,k) [elementp<of rE] 



For j <- k+1 to n da 



A(fIV(k),j) <-> A(k,j) 
For i k+1 to n do 
A(i,j) <- A(i,-j).- A(i,k)*As(k, j) 




Prpm this we can observe "that, for example, 



T > 

U • • • 9 




are all mutually noninter fer ing tasks and co.uldbe executed 
in parallel. ^ More specifically, we observe that C = 
(T,< # ), where <• is. the transitive closure on the 
relation . 

•x = { (.t£, TjV| 4 k<j<n )u { (T^k T^ +1 ) | k<j<n } 

> 

is^ a maximally parallel system equivalent" to C. This system 
is illustrated in Figure 24. ' ' 

' % Given the task system. C 1 we now determine the execution 
time of the tasks and fr om ^ that* determine a schedule. We 
assume that one mult iply cfnd^ one subtract, r 6r one multiply 
and one c6jnpars-, constitutes a time step, ^ Thu^, neglecting 

any overhead for loop control, the execution time W(T^) for / 
each of the.taSks is given by: i . . 




W(T 



k \ = Tn+1-k if k=j' 
j' , \n-k k<j 



Treating the task system C together with ^W(T^)^ as a 
weighted graph we observe that the longest "path traverses 
the n6des : - . ? -* * , ■ \ • 

T 1 T 2 T 2 3 3 . mn-1* T n 
. . V V A 2 ' V A 3' n-T n-1 



We denote this path as S-j and the length o5 the path as 



n-1 



^(S,) = n+1 + v 2 3 = n2 - 1 " 



> 
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Since the problem fcannot bp- solved' in a time shorter than 
this path length* we developed a schedule where the t^sks 
constituting S 1 are assigned to processor 1 and the remain- 
ing tasks are assigned to [n/2] - 1 additional processors, 
' Processor 2 execute^ the tasks 

* * 3 4 - 4 5 ' 5 n ' 

T 1f T 1f T 2 , Tg, T 3 , T n « 2 



More generally, processor j executes the tasks 



V 

2j-1 . 2j 2j 2j + 1 n 
«V , T/ f T 2 , T 2 , T n _ 2 *(j-1) 




We denote tnis as S-;. Note that this is not a path through 
the graph. .Fot the case where n is even, this schedule is 
illustrated in Figure 25. Sincesthis schedule has length 4 
n 2 - 1, the length of the longest^ path, then this schedule 
isr optimal . for n/2 processors. Using this schedule we note 

i ! «... 

that: 



fa * * 

' / " lim ri 3 /3 + *0(n' 2 ) _ 2 ' ' * 

™ S p/P " n-> » (n 2 -1) n/2* 3 * *• A 7 

■ ^ * * * * 

! and this efficiency is achieved to within 2% for relatively 
small n (n >. 50")'. . ^ * - 

* tfhese schedules wer#^pjrq>g rammed using*HEP FORTRAN," and- g 

were tun, on .the HE§P parallel computer/ AVthougfi* the program * . s * ^* 
solved' a se"t of. linear e^ta^ti-on^,, ^we recorded timing for^ 
*only the LU decomposition so^thaV it? could be .compared „ with 
Vjhe predicted solutfbn 1?£raejs. ^FabJ.e*7 jgiv£s;.the actual, and 
pxe4^6ted efficiencies '£qr jthes* nuttfblr^of equations "ranging , 
. "from 10 XO 35 and the numberVof' parallel instruction streams ^ 

ranging from 2'to 8. » , . * - — V \ 4 ^ * 4 



♦ % A T V**^ . ■ 



4 * . • 
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r 



no. of 
equa- 
tions 



10 



A .833 
P .852 



15 



20 



A .921 
P .931 



25 



30 



A .942 
P .949 



.948 



35 




.719 .642 
.739 .678 



.633 
.685 



A .888 .79-4, .740 ,65<1- .618 ..625 
P .900 .815 .766. .679 .652 /681 



.843 



.863,. 



.774 
.798 



.758 
.789 



.6/70 

i 

.703 



.623 
.656 



A .934 .878 .\830 .763 .755* ' .692 
-P .944 • .8-96 .855 ' .739 .788 .726 



..,892 .844 .818 ".757 .744 
.91 1 .863 .843 .733 ■ .7*77 



.901 



t .956 .918 



.•862 
.880 



.819 
.843 



.790 
.627 



.747 
.779 



A = Actual efficiency. 
P = Predicted efficiency. 




.741 




t 



Table 7 - Efficiency of LO Decomposition j 
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SECTION 5: CONCLUSIONS 

In this study'we have examined programs that are all in 
support of flight simulation , but 'which can also be' categor- 
ized by the -types of mathematical functions or services that 
they supply. Categorized in thi^jrmnner they are: 

(1) numerical approximation of elementary functions, 

(2) solution of linear - algebraic equations, 

(3) solution of shortest path problems on graphs, and 

(4) solution of ordinary differential equations. 

For problems in the first category/' we examined the 
program at an arithmetic instruction level and produced 
parallel code based on this examination. These techniques 
produced speed-ups in the range ^of two to three. Since^N 
elementary function approximation involves small amounts of 
computation, these very modest' speecl-ups are perhaps to be 
expected. In producing parallel code for these functions, 
we were guided by the formal work in the area of polynomial 
evaluation. We. do not forese«^any automated approach to 
producing a. library of elementary functions for a particular- 
MIMD computer, but this does not adversely affect the 
potential of MIMD computing. Historically, elemen-tar,y func- 
tion libraries have been codjy^ln machine language and high- 
ly tailored for the target machine. ■ 

To solve linear algebraic equations, we unrolled the DO 
loops, represented the computation as a task system, and * ^ 
'Irom this produced a number of DO loops' that could be exe- 
cuted in parallel. As we mentioned in Section 2, automatic, 
detection of parallelism within nested DO Ipops is receiving * 
considerable research interest. We belie^^that, if future 
Air Force simulation requirements include flexible* tody' 
representations (generally requiring solution of linear 
algebraic aquations), either the algorithms developed here 
will be of benefit or automatic recognizers of parallelism 
within nested DO. loops will be available. The speed-up 
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r 



' avail-able in this problem type-, is bounded only by the size 
of the problem. * For example, given a set of 100 linear 
equations, our algorithm^ solves them approximately 35 times 
faster. than a sequential equivalent, and with very good 
efficiency. <■ v 

i 

Producing, a*parallel version of the shortest path pro- 
gram involved, unrolling the DO loops and treating the 

r t 

resulting code* as a task system. To determine the preced- 
ence relations, however, we used> information from both the 
code *of the "program and knowledge of the input data sets for' 
which this program was correct. Thus it is difficult to see 
how automatic detection of patallelism within DO loops could 
have produced the same parallels we did. But this should 
not detract fi^om the benefits of MIMD computing. In most 
flight siflHriralfJ&n programs, minimum and maximum path algor- 
ithms are usually utility routMn'es; their status as library 
function should be adequate was the case for the linear 
equations', - parallelism and speed-up is bounded only t^the 
si^e of the problem. Bor example, given a shortest or long- 
est path problem involving 100 points, a speed-up of 100 
over an equivalent sequential program is achievable with 
efficiencies near 100%. 

For solving ordinary differential equations, a variety 
of techniques were investigated. Those described in Section 
3 seemed most successful. Two distinct flight simulation 
programs. were ^examined using these techniques, ¥he ground- 
launched missile and the aerodynamics portion of the A-10 
flight stimulation. The A-10 aerodynamics is approximately 
twice as \tuch!code as the missive simulation (2803 machine 
instr uctioh^sVversus 1384) and presumably represents, the 
apptoximate fidelity of simulations cur^entl^y used for 
training purposes. On these assumptions we conclude that a 
MIMD computer* of the power^ of HEP could pe used in Air Force 
flight simulation projects in two ways 

(1) as" a multiprogramming computer capable of running 
several concurrent si-mu^at ions Qf the fidelity of 
•the* A-10 simulation, oi 
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(2) as a compute^ capable of running one or two con- 
current flight simulations of significantly more ft 
fidelity ^than the current A-10 programs. 

, j 

Greater fidelity is poasible nc*t only in the aerodynamic 
section but also in commuting, visual cu$s for the trainee. 

. .Should an- MIMD computer of significantly less power 
than HEP be employed for flight simulation, our study 
indicates that there is adequate parallel ism^within these 
types of, problems that lesser computing power would b^e 
adequate for single simulation programs. 
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